TreeTagger fails in Google Compute Engine with Apache Spark

reckart commented 9 years ago

Original issue 20 created by reckart on 2014-11-25T11:24:59.000Z:

What steps will reproduce the problem?

Create an App with TreeTaggerWrapper
Create a .jar of the project and upload it to a Google Compute Engine which runs Apache spark with some worker instances
Run it on GCE

Output: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to s tage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 8, spark-worker-1a.c.****.internal): java.lang.Null PointerException: org.annolab.tt4j.TreeTaggerWrapper.removeProblematicTokens(TreeTaggerWra pper.java:684) org.annolab.tt4j.TreeTaggerWrapper.process(TreeTaggerWrapper.java:557)

When I run my App on my local machine, it works fine. I only get this error message when I run it on a google compute engine using apche spark. I already enabled the performance mode -> treetagger.setPerformanceMode(true) but still get the same error message.

Duplicate: http://stackoverflow.com/questions/27123826/treetaggerwrapper-fails-in-google-compute-engine-with-apache-spark?noredirect=1

reckart commented 9 years ago

Comment #1 originally posted by reckart on 2014-11-25T11:30:55.000Z:

<empty>

reckart commented 9 years ago

Comment #2 originally posted by reckart on 2014-11-25T11:32:56.000Z:

I'd be surprised if the GAE allowed you to run native binaries. Are you sure this is allowed?

reckart commented 9 years ago

Comment #3 originally posted by reckart on 2014-11-25T11:48:21.000Z:

When I try the example from command line (inside GCE) $ echo 'Hello world!' | cmd/tree-tagger-english-utf8 (see: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) it works.

reckart commented 9 years ago

Comment #4 originally posted by reckart on 2014-11-25T11:50:51.000Z:

Can you reproduce the original problem or did you just copy the stackoverflow report here?

If you can reproduce it, what version of TT4J are you using?

reckart commented 9 years ago

Comment #5 originally posted by reckart on 2014-11-25T11:56:21.000Z:

I'm the author of the stackoverflow report =) 1.2.0

reckart commented 9 years ago

Comment #6 originally posted by reckart on 2014-11-25T12:03:58.000Z:

I see :) Great!

The problematic line in 1.2.0 is this one:

        boolean isUnicode = "UTF-8".equals(_model.getEncoding().toUpperCase(Locale.US));

I can see potential for a NPE here, but I wonder why it works locally but not on the GCE.

Do you provide an encoding for the model? Does GCE have a problem with Locale.US?

Are you using tt4j directly or within another framework, e.g. in DKPro Core? If you are using it directly, you might want to give version 1.2.1 a try which offers a way of setting a model without using a model resolver.

reckart commented 9 years ago

Comment #7 originally posted by reckart on 2014-11-25T12:28:07.000Z:

My local machine is Win8, the GCE has Debian (so I use different treetagger packages). My setup is nearly the same as you provide in your example:

System.setProperty("treetagger.home", "/home/spark/resources/treetagger"); try { //tt.setModel("c:/treetagger/lib/german-utf8.par"); //local tt.setModel("/home/spark/resources/treetagger/lib/german-utf8.par"); //gce tt.setPerformanceMode(true); tt.setHandler(new TokenHandler() { public void token(String token, String pos, String lemma) { output.put(token, lemma.toLowerCase().replace("_", " ")); } }); }

So I use it directly. I just tried v1.2.1 (but with no changes in my source code) it produces the same error - Should I change my setup? How?

reckart commented 9 years ago

Comment #8 originally posted by reckart on 2014-11-25T13:07:00.000Z:

When loading a model, you should specify an encoding. This can be done in two ways:

1)

treetagger.setModel(modelFile.getPath() + ":" + encoding);

2) (works only with 1.2.1+)

DefaultModel model = new DefaultModel( modelFile.getPath() + ":" + encoding, modelFile, encoding, DefaultModel.DEFAULT_FLUSH_SEQUENCE); treetagger.setModel(model);

reckart commented 9 years ago

Comment #9 originally posted by reckart on 2014-11-25T14:47:12.000Z:

I just found sth. out what I should have tested much earlier: When I run my app on my local machine, I set an option to run it only on this one local machine. When I run it in gce, I set an option for a "parallel run", means, the task will be committed to multiple worker-instances, so that it can processed parallel. Now I set the option for "local run" in gce - and it succeeded!

reckart commented 9 years ago

Comment #10 originally posted by reckart on 2014-11-25T15:54:36.000Z:

Ok, sounds this issue can be closed then :)

reckart commented 9 years ago

Comment #11 originally posted by reckart on 2014-11-25T17:58:27.000Z:

@Steffen: one more question: what does it mean to set the option for "local run" and how do you think could that be something that indirectly triggers the NPE?

reckart commented 9 years ago

Comment #12 originally posted by reckart on 2014-11-26T08:32:38.000Z:

No, it was my fault all the time: Some time ago I deleted and re-installed my worker-instances but didn't installed treetagger on this worker-instances! I forgot about that =/

reckart / tt4j

TreeTagger fails in Google Compute Engine with Apache Spark #20