Open nick-magnini opened 8 years ago
uh, rather weird, but I have definitely not used this on any asian lang.
Probably related to this issue: http://stackoverflow.com/questions/7509905/java-lang-stackoverflowerror-while-using-a-regex-to-parse-big-strings I definitely want to replace the class I added to clean the Wikipedia boilerplate. I assume chinese can have very long paragraphs with no spaces whatsoever ?
I will try to address this and the other issues you have mentioned over the weekend.
Hi, Is there any update for this problem? I am also facing a similar problem when dealing with Chinese text:
2017-12-22 23:11:31 ERROR Executor:96 - Exception in task 122.0 in stage 0.0 (TID 122)
java.lang.StackOverflowError
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3776)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4250)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4263)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)2017-12-22 22:55:03 INFO Executor:59 - Executor is trying to kill task 125.0 in stage 0.0 (TID 125)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-12-22 22:55:03 INFO Executor:59 - Executor is trying to kill task 126.0 in stage 0.0 (TID 126)
2017-12-22 22:55:03 INFO Executor:59 - Executor is trying to kill task 123.0 in stage 0.0 (TID 123)
Chinese Wikipedia pops this error out when creating word2vec corpus using: org.idio.wikipedia.word2vec.Word2VecCorpus class.