optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
568 stars 165 forks source link

could not initialize class com.optimaize.langdetect.profiles.BuiltInLanguages when running on spark #72

Open ekapratama93 opened 7 years ago

ekapratama93 commented 7 years ago

I get error java.lang.NoClassDefFoundError:could not initialize class com.optimaize.langdetect.profiles.BuiltInLanguages when using language-detector in spark. I'm using suggested method to load the profile. List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn(); LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard()).withProfiles(languageProfiles).build();

here is some stacktrace : 16/11/30 17:36:05 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,ExceptionFailure(java.lang.NoClass$ efFoundError,Could not initialize class com.optimaize.langdetect.profiles.BuiltInLanguages,[Ljava.lang.StackTraceElement;@4a5ae036,java.lang.NoClassDefFoundError: C$ uld not initialize class com.optimaize.langdetect.profiles.BuiltInLanguages at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltIn(LanguageProfileReader.java:118) at com.ebdesk.ph.nlp_sentence.TwitterWord2Vec$1.call(TwitterWord2Vec.java:86) at com.ebdesk.ph.nlp_sentence.TwitterWord2Vec$1.call(TwitterWord2Vec.java:1) at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1028) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

DanielGSM commented 7 years ago

A bit of a late help for anybody that might have the same problem and find this (as I did), based on everything I've been able to found and which I hope is helpful enough to anybody with this problem, since I have found quite a few people with this problem, but not so many complete solutions. As it's my first time with Spark and I was also suffering with the not-serializables tasks, trying to instantiate things in the workers themselves and a lot more, it's wasn't easy to pinpoint that this was the real problem, since it could have been a lot of things.

Be aware that I'm not expert at Spark by no means though.

and LdLocale.fromString(String) does something like

[...]
List<String> strings = Splitter.on('-').splitToList(string);
[...]

The code they used:

//List<String> strings = Splitter.on('-').splitToList(string);
List<String> strings = new ArrayList<String>();
String[] stringParts = string.split("-");
for (String stringpart: stringParts){
    strings.add(stringpart);
}

The repository with the code: https://github.com/netarchivesuite/language-detector/commit/57ba6edda1e59ad25e24c395091805731b9df43a

The Jira issue where I found the repository: https://sbforge.org/jira/browse/WEBDAN-86?focusedCommentId=31306&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-31306

andresviikmaa commented 6 years ago

adding --conf "spark.executor.userClassPathFirst=true" to spark-submit makes spark to load user jars first and then you can bundle newer guava into your spark job's jar

brad-safetonet commented 6 years ago

I tried the "userClassPathFirst" strategy (Spark 2.3.1), but unfortunately adding that config seemed to bork something else unrelated in Spark. Possibly Spark depends on the behavior of the older version of Guava and going up to Guava 19 makes it blow up? Hard to tell.

brad-safetonet commented 6 years ago

The GitHub fork in the comment by @DanielGSM has a jar file that can be dropped into projects to solve this. Probably the easiest solution.