TextTokenizer issues - Githubissues

salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

https://transmogrif.ai

BSD 3-Clause "New" or "Revised" License

2.24k stars 393 forks source link

TextTokenizer issues #275

Closed razou closed 5 years ago

razou commented 5 years ago

Does anybody experienced the following issue with the text tokenizer transformer?

Environment: Spark: 2.3.3 with Zeppelin 0.8.0 on AWS EMR Edit: I'm using Spark 2.3.2 instead of 2.3.3 and com.salesforce.transmogrifai:transmogrifai-core_2.11:0.5.1

import com.salesforce.op.stages.impl.feature.TextTokenizer
val rawTitle = FeatureBuilder.Text[Book].extract(_.rawTitle.toText).asPredictor
val rawTitleTokens = new TextTokenizer().setInput(rawTitle).setAutoDetectLanguage(true).setToLowercase(true).setMinTokenLength(2).getOutput()

The error I'm getting

import com.salesforce.op.stages.impl.feature.TextTokenizer
rawTitle: com.salesforce.op.features.Feature[com.salesforce.op.features.types.Text] = Feature(name = rawTitle, uid = Text_00000000000e, isResponse = false, originStage = FeatureGeneratorStage_00000000000e, parents = [], distributions = [])
java.lang.NoClassDefFoundError: Could not initialize class com.salesforce.op.stages.impl.feature.TextTokenizer$
  ... 60 elided

Thanks

tovbinm commented 5 years ago

Is there a full stack trace?

There might be an issue with Spark version. We have tested it with Spark 2.3.2.

razou commented 5 years ago

Yes, it's the full stack trace. Oh, sorry, for the mistake, I'm using spark 2.3.2. May be it's something related to zeppelin. I'll continue to test and investigate.

tovbinm commented 5 years ago

I would need more information to help you. Try directly accessing the TextTokenizer object as follows:

com.salesforce.op.stages.impl.feature.TextTokenizer.tokenize(Text("hello world"))

razou commented 5 years ago

Thanks @tovbinm It works outside of zeppelin.

Is there a mailling list where we can ask these king of questions instead of opening issue ticket ?

razou commented 5 years ago

Hey @tovbinm have you tested this with TransmogrifAI 0.5.2 Seq(A, B, C).transmogrify() where A, .., C are com.salesforce.op.features.Feature

  at com.salesforce.op.utils.text.LuceneTextAnalyzer$.<init>(LuceneTextAnalyzer.scala:133)
  at com.salesforce.op.utils.text.LuceneTextAnalyzer$.<clinit>(LuceneTextAnalyzer.scala)
  at com.salesforce.op.stages.impl.feature.TextTokenizer$.<init>(TextTokenizer.scala:126)
  at com.salesforce.op.stages.impl.feature.TextTokenizer$.<clinit>(TextTokenizer.scala)
  at com.salesforce.op.stages.impl.feature.TransmogrifierDefaults$class.$init$(Transmogrifier.scala:85)
  at com.salesforce.op.stages.impl.feature.TransmogrifierDefaults$.<init>(Transmogrifier.scala:90)
  at com.salesforce.op.stages.impl.feature.TransmogrifierDefaults$.<clinit>(Transmogrifier.scala)
  at com.salesforce.op.dsl.RichFeaturesCollection$RichAnyFeaturesCollection.transmogrify(RichFeaturesCollection.scala:70)
  ... 60 elided

NB: It works when I use 0.5.1 Thanks

tovbinm commented 5 years ago

works fine for me just fine from spark shell

$SPARK_HOME/bin/spark-shell --packages com.salesforce.transmogrifai:transmogrifai-core_2.11:0.5.2
...
Spark context available as 'sc' (master = local[*], app id = local-1555341642349).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.3
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import com.salesforce.op.features.types._
import com.salesforce.op.features.types._

scala> com.salesforce.op.stages.impl.feature.TextTokenizer.tokenize(Text("hello world"))
res0: com.salesforce.op.stages.impl.feature.TextTokenizer.TextTokenizerResult = TextTokenizerResult(Unknown,List(TextList(hello, world)))

razou commented 5 years ago

I think that my the problem is Zeppelin. Thanks