idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

Chinese Wikipedia StackOverflowError #14

Open nick-magnini opened 8 years ago

nick-magnini commented 8 years ago

Chinese Wikipedia pops this error out when creating word2vec corpus using: org.idio.wikipedia.word2vec.Word2VecCorpus class.

   java.lang.StackOverflowError
   at java.util.regex.Pattern$CharProperty.match(Pattern.java:3705)
   at java.util.regex.Pattern$Curly.match0(Pattern.java:4160)
   at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
       .........
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4144)
    at java.util.regex.Pattern$Slice.match(Pattern.java:3882)
    at java.util.regex.Pattern$Start.match(Pattern.java:3420)
    at java.util.regex.Matcher.search(Matcher.java:1211)
    at java.util.regex.Matcher.find(Matcher.java:604)
    at java.util.regex.Matcher.replaceAll(Matcher.java:914)
    at scala.util.matching.Regex.replaceAllIn(Regex.scala:298)
    at org.idio.wikipedia.word2vec.ArticleCleaner$.cleanStyle(ArticleCleaner.scala:69)
    at org.idio.wikipedia.word2vec.Word2VecCorpus$$anonfun$cleanArticles$1.apply(Word2VecCorpus.scala:65)
    at org.idio.wikipedia.word2vec.Word2VecCorpus$$anonfun$cleanArticles$1.apply(Word2VecCorpus.scala:56)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
    at java.lang.Thread.run(Thread.java:809)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-01-25 16:25:08 WARN  TaskSetManager:71 - Lost task 57.0 in stage 0.0 (TID 57, localhost): TaskKilled (killed intentionally)```
dav009 commented 8 years ago

uh, rather weird, but I have definitely not used this on any asian lang.

Probably related to this issue: http://stackoverflow.com/questions/7509905/java-lang-stackoverflowerror-while-using-a-regex-to-parse-big-strings I definitely want to replace the class I added to clean the Wikipedia boilerplate. I assume chinese can have very long paragraphs with no spaces whatsoever ?

I will try to address this and the other issues you have mentioned over the weekend.

jiesutd commented 6 years ago

Hi, Is there any update for this problem? I am also facing a similar problem when dealing with Chinese text:

2017-12-22 23:11:31 ERROR Executor:96 - Exception in task 122.0 in stage 0.0 (TID 122)
java.lang.StackOverflowError
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3776)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:4250)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)2017-12-22 22:55:03 INFO  Executor:59 - Executor is trying to kill task 125.0 in stage 0.0 (TID 125)

    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-12-22 22:55:03 INFO  Executor:59 - Executor is trying to kill task 126.0 in stage 0.0 (TID 126)
2017-12-22 22:55:03 INFO  Executor:59 - Executor is trying to kill task 123.0 in stage 0.0 (TID 123)