TreeTagger fails in Google Compute Engine with Apache Spark

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Create an App with TreeTaggerWrapper
2. Create a .jar of the project and upload it to a Google Compute Engine which 
runs Apache spark with some worker instances
3. Run it on GCE

Output:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
s   tage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost 
task    1.3 in stage 1.0 (TID 8, spark-worker-1a.c.********.internal): 
java.lang.Null   PointerException:
    org.annolab.tt4j.TreeTaggerWrapper.removeProblematicTokens(TreeTaggerWra   pper.java:684)
    org.annolab.tt4j.TreeTaggerWrapper.process(TreeTaggerWrapper.java:557)

When I run my App on my local machine, it works fine.
I only get this error message when I run it on a google compute engine using 
apche spark.
I already enabled the performance mode -> treetagger.setPerformanceMode(true)
but still get the same error message.

Duplicate: 
http://stackoverflow.com/questions/27123826/treetaggerwrapper-fails-in-google-co
mpute-engine-with-apache-spark?noredirect=1

Original issue reported on code.google.com by steffen...@web.de on 25 Nov 2014 at 11:24

GoogleCodeExporter commented 9 years ago

Original comment by steffen...@web.de on 25 Nov 2014 at 11:30

Attachments:

stacktrace.txt

GoogleCodeExporter commented 9 years ago

I'd be surprised if the GAE allowed you to run native binaries. Are you sure 
this is allowed?

Original comment by richard.eckart on 25 Nov 2014 at 11:32

GoogleCodeExporter commented 9 years ago

When I try the example from command line (inside GCE)
$ echo 'Hello world!' | cmd/tree-tagger-english-utf8 
(see: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
it works.

Original comment by steffen...@web.de on 25 Nov 2014 at 11:48

GoogleCodeExporter commented 9 years ago

Can you reproduce the original problem or did you just copy the stackoverflow 
report here?

If you can reproduce it, what version of TT4J are you using?

Original comment by richard.eckart on 25 Nov 2014 at 11:50

GoogleCodeExporter commented 9 years ago

I'm the author of the stackoverflow report =)
1.2.0

Original comment by steffen...@web.de on 25 Nov 2014 at 11:56

GoogleCodeExporter commented 9 years ago

I see :) Great!

The problematic line in 1.2.0 is this one:

            boolean isUnicode = "UTF-8".equals(_model.getEncoding().toUpperCase(Locale.US));

I can see potential for a NPE here, but I wonder why it works locally but not 
on the GCE.

Do you provide an encoding for the model?
Does GCE have a problem with Locale.US?

Are you using tt4j directly or within another framework, e.g. in DKPro Core? If 
you are using it directly, you might want to give version 1.2.1 a try which 
offers a way of setting a model without using a model resolver.

Original comment by richard.eckart on 25 Nov 2014 at 12:03

GoogleCodeExporter commented 9 years ago

My local machine is Win8, the GCE has Debian (so I use different treetagger 
packages). My setup is nearly the same as you provide in your example:

System.setProperty("treetagger.home", "/home/spark/resources/treetagger");
try {
    //tt.setModel("c:/treetagger/lib/german-utf8.par"); //local
    tt.setModel("/home/spark/resources/treetagger/lib/german-utf8.par"); //gce
    tt.setPerformanceMode(true);
    tt.setHandler(new TokenHandler<String>() {
                  public void token(String token, String pos, String lemma) {
                        output.put(token, lemma.toLowerCase().replace("_", " "));
                        }
                });
    }

So I use it directly.
I just tried v1.2.1 (but with no changes in my source code) it produces the 
same error - Should I change my setup? How?

Original comment by steffen...@web.de on 25 Nov 2014 at 12:28

GoogleCodeExporter commented 9 years ago

When loading a model, you should specify an encoding. This can be done in two 
ways:

1) 

treetagger.setModel(modelFile.getPath() + ":" + encoding);

2) (works only with 1.2.1+)

DefaultModel model = new DefaultModel(
  modelFile.getPath() + ":" + encoding,
  modelFile, encoding, DefaultModel.DEFAULT_FLUSH_SEQUENCE);
  treetagger.setModel(model);

Original comment by richard.eckart on 25 Nov 2014 at 1:07

GoogleCodeExporter commented 9 years ago

I just found sth. out what I should have tested much earlier:
When I run my app on my local machine, I set an option to run it only on this 
one local machine. When I run it in gce, I set an option for a "parallel run", 
means, the task will be committed to multiple worker-instances, so that it can 
processed parallel.
Now I set the option for "local run" in gce - and it succeeded!

Original comment by steffen...@web.de on 25 Nov 2014 at 2:47

GoogleCodeExporter commented 9 years ago

Ok, sounds this issue can be closed then :)

Original comment by richard.eckart on 25 Nov 2014 at 3:54

Changed state: Invalid

GoogleCodeExporter commented 9 years ago

@Steffen: one more question: what does it mean to set the option for "local 
run" and how do you think could that be something that indirectly triggers the 
NPE?

Original comment by richard.eckart on 25 Nov 2014 at 5:58

GoogleCodeExporter commented 9 years ago

No, it was my fault all the time: Some time ago I deleted and re-installed my 
worker-instances but didn't installed treetagger on this worker-instances! I 
forgot about that =/

Original comment by steffen...@web.de on 26 Nov 2014 at 8:32

tema16 / tt4j

TreeTagger fails in Google Compute Engine with Apache Spark #20