master / spark-stemming

Spark MLlib wrapper for the Snowball framework
BSD 2-Clause "Simplified" License
33 stars 20 forks source link

Stemming problem #8

Closed ensozos closed 6 years ago

ensozos commented 6 years ago

sentence: " 12-Gauge Angle" gives the stemmed word angle "angl" which is correct but sentence: "Angle brucket" gives "angle" as stemmed word

master commented 6 years ago

Can't reproduce:

scala> val sentenceDataFrame = spark.createDataFrame(Seq((0, "12-Gauge Angle"), (1, "Angle brucket"))).toDF("id", "sentence")
sentenceDataFrame: org.apache.spark.sql.DataFrame = [id: int, sentence: string]

scala> sentenceDataFrame.show
+---+--------------+
| id|      sentence|
+---+--------------+
|  0|12-Gauge Angle|
|  1| Angle brucket|
+---+--------------+

scala> val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_73998f8e2c95
scala> val data = tokenizer.transform(sentenceDataFrame).select("words")
data: org.apache.spark.sql.DataFrame = [words: array<string>]

scala> data.show
+-----------------+
|            words|
+-----------------+
|[12-gauge, angle]|
| [angle, brucket]|
+-----------------+

scala> val stemmer = new Stemmer().setInputCol("words").setOutputCol("stemmed").setLanguage("English").transform(data).show(false)
+-----------------+---------------+
|words            |stemmed        |
+-----------------+---------------+
|[12-gauge, angle]|[12-gaug, angl]|
|[angle, brucket] |[angl, brucket]|
+-----------------+---------------+
master commented 6 years ago

Closing as can't reproduce. Feel free to reopen if needed.

ensozos commented 6 years ago

I had the older version ( 0.1.1 ) that's why i was getting the bug. With 0.2.0 it works perfect! Sorry for the silly mistake and thank you for your reply

master commented 6 years ago

No worries :)