master / spark-stemming

Spark MLlib wrapper for the Snowball framework
BSD 2-Clause "Simplified" License
33 stars 20 forks source link

Allow Stemmer to be written to and read from Pipeline #10

Closed esap120 closed 5 years ago

esap120 commented 6 years ago

Hey there, this pull request would allow for the Stemmer Transformer to be written out as part of a PipelineModel and also to be read in as a part of a PipelineModel.

For instance the current example fails at runtime:

val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val stemmer = new Stemmer()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("stems")
.setLanguage("English")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(stemmer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.001)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, stemmer, hashingTF, lr))

val model = pipeline.fit(training)
model.write.overwrite().save("/tmp/spark-logistic-regression-model")

val sameModel = PipelineModel.read.load("/tmp/spark-logistic-regression-model")

sameModel.transform(test)
jacek-rzrz commented 5 years ago

Is there any chance of merging this?

esap120 commented 5 years ago

@master bump

@jacek-rzrz In the meantime you can try doing something similar to what I did which was to extend the Stemmer class and make your own writable version: https://github.com/esap120/spark-twitter-streaming/blob/master/src/main/scala/WritableStemmer.scala

master commented 5 years ago

Thanks, merged