joshua-decoder / joshua

Joshua Statistical Machine Translation Toolkit
http://joshua-decoder.org/
121 stars 56 forks source link

message about using `--lm-gen berkeleylm` is incomplete #154

Closed bethard closed 9 years ago

bethard commented 10 years ago

If you run examples/pipeline/run.sh on OS X, you'll get the following error and message:

* FATAL: /Users/bethard/Downloads/joshua-v5.0/bin/lmplz (for building LMs) does not exist.
  If you are on OS X, you need to use either SRILM (recommended) or BerkeleyLM,
  triggered with '--lm-gen srilm' or '--lm-gen berkeleylm'. If you are on Linux,
  you should run "ant -f $JOSHUA/build.xml kenlm".

I believe the message about --lm-gen is incomplete. At least, when I added just --lm-gen berkeleylm to the run.sh script, I just got the following error:

[berkeleylm] rebuilding...
  dep=/Users/bethard/Downloads/joshua-v5.0/examples/pipeline/1/data/train/corpus.en [CHANGED]
  dep=lm.gz [NOT FOUND]
  cmd=java -ea -mx2g -server -cp /Users/bethard/Downloads/joshua-v5.0/lib/berkeleylm.jar edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText 5 lm.gz /Users/bethard/Downloads/joshua-v5.0/examples/pipeline/1/data/train/corpus.en
  took 2 seconds (2s)
[compile-kenlm] rebuilding...
  dep=lm.gz [CHANGED]
  dep=lm.kenlm [NOT FOUND]
  cmd=/Users/bethard/Downloads/joshua-v5.0/src/joshua/decoder/ff/lm/kenlm/build_binary lm.gz lm.kenlm
  JOB FAILED (return code 127)
/bin/bash: /Users/bethard/Downloads/joshua-v5.0/src/joshua/decoder/ff/lm/kenlm/build_binary: No such file or directory

I had to add --lm berkeleylm in addition to --lm-gen berkeleylm to get that script to run to completion.

mjpost commented 10 years ago

If build_binary didn't exist, it looks like you needed to run "ant kenlm". build_binary is KenLM's tool for compiling (optionally compressed) ARPA-style LMs into Ken's compiled format.

bethard commented 10 years ago

This was using the pipeline script, so presumably, if it needs to run "ant kenlm" it should have figured that out itself?

But all I'm really asking for here is a better error message. Instead of "If you are on OS X, you need to use ... BerkeleyLM, triggered with ... '--lm-gen berkeleylm'", it should say something like "If you are on OS X, you need to use ... BerkeleyLM, triggered with ... '--lm-gen berkeleylm --lm berkeleylm'."

Or are you saying that I shouldn't need to supply "--lm berkeleylm" and there's something wrong with the pipeline script?

mjpost commented 10 years ago

--lm-gen determines what code is used to build the ARPA LM file; it defaults to KenLM's "lmplz" tool, which doesn't build on OS X. BerkeleyLM also has a tool, but I don't recommend its use because of the heuristics it uses for smoothing. Instead, I suggest SRILM, if lmplz is not available.

--lm determines the toolkit used to represent LM state in the decoder. It also defaults to KenLM, which then tries to use the "build_binary" tool to compile the ARPA LM. That should compile on all platforms, so if not, you have a problem (like you didn't type "ant kenlm"). BerkeleyLM has its own tool for compiling LMs. KenLM is recommended because it supports left-state minimization, which results in slightly more efficient search. Apart from that, they are equivalent.

bethard commented 10 years ago

If adding "--lm berkeleylm" is not the preferred solution on OS X and "ant kenlm" is the preferred solution, then somewhere, in one of the error messages I posted above, it should direct you to run "ant kenlm" on OS X. I'm really not particular as to the solution; I just want an error message that gives better guidance for solving the problem.

mjpost commented 10 years ago

Agreed. We'd very happily accept a pull request with a fix.

bethard commented 10 years ago

So presumably if the fix is "ant kenlm", the place that the error message should be added is:

[compile-kenlm] rebuilding...
  dep=lm.gz [CHANGED]
  dep=lm.kenlm [NOT FOUND]
  cmd=/Users/bethard/Downloads/joshua-v5.0/src/joshua/decoder/ff/lm/kenlm/build_binary lm.gz lm.kenlm
  JOB FAILED (return code 127)

Is that right?

If you can point me to roughly where in the code this message is generated, I can probably provide a pull request that improves the error message.

mjpost commented 10 years ago

Are you using the development or packaged version of Joshua? If devel you had to type "ant devel" (which calls "ant kenlm") to compile support libs. No need to put a warning about that in the pipeline, I don't think.

bethard commented 10 years ago

I was using the packaged one. Are you saying this is already a non-issue in trunk? If so, feel free to close. If not, feel free to point me to roughly where a fix might belong.

bethard commented 10 years ago

(Sorry, should have been clearer. I'm happy to clone the current repository and create a fix if it's still necessary, and someone can point me in the right direction.)

mjpost commented 10 years ago

The packaged one occasionally fails to build the KenLM libraries, despite that being a dependency of the "all" target. That would be the thing to fix: figure out why typing "ant" or "ant all" sometimes fails to build KenLM. That should just be a matter of a small change to build.xml

mjpost commented 10 years ago

You just need to figure out why KenLM didn't build, fix it, and issue a pull request from and against the "release" branch. Or were you asking something else?

bethard commented 10 years ago

Something else. Given that it's possible that KenLM isn't built sometimes, I think the error message in pipeline.pl should indicate that the problem might be a failed KenLM build. For example, if instead of:

[compile-kenlm] rebuilding...
  dep=lm.gz [CHANGED]
  dep=lm.kenlm [NOT FOUND]
  cmd=/Users/bethard/Downloads/joshua-v5.0/src/joshua/decoder/ff/lm/kenlm/build_binary lm.gz lm.kenlm
  JOB FAILED (return code 127)

It had said:

[compile-kenlm] rebuilding...
  dep=lm.gz [CHANGED]
  dep=lm.kenlm [NOT FOUND] (KenLM may not have been built. Try running "ant kenlm".)
  cmd=/Users/bethard/Downloads/joshua-v5.0/src/joshua/decoder/ff/lm/kenlm/build_binary lm.gz lm.kenlm
  JOB FAILED (return code 127)

Then the solution would have been more obvious.

In general, one of the things we've struggled with in trying to use Joshua is the error messages not giving enough detail to help us figure out what we've done wrong. So this seemed like a place where we could improve the error message.

That's not to say that it wouldn't be useful to dig into any problems with KenLM not building, but my goal here is just to improve error messages.

mjpost commented 9 years ago

The build system has been changed a bit including fixes for compiling KenLM utils on OS X. I'm going to mark this as fixed in Joshua 6, unless you find that it still exists, in which case I'll take a closer look this time with an eye towards fixing it.