modernmt / modernmt

Neural Adaptive Machine Translation that adapts to context and learns from corrections.
http://www.modernmt.eu/
Apache License 2.0
342 stars 71 forks source link

Training error for en-zh #193

Closed NickRuiz closed 7 years ago

NickRuiz commented 7 years ago

I tried a simple en-zh (English-Mandarin) training example using a small snippet of TED talk training data. When I attempt to run ./mmt create, I get an error. It seems like the /home/interact/mmt/engines/enzh_tmp/models/vocabulary folder is never created, which might be causing the issue. enzh_tmp.zip

$ ./mmt create en zh ~/mt-experiments/data/train/enzh_tmp -e enzh_tmp

=========== TRAINING STARTED ===========

ENGINE:  enzh_tmp
BILINGUAL CORPORA: 1 documents
MONOLINGUAL CORPORA: 0 documents
LANGS:   en > zh

INFO: (1 of 7) TMs clean-up...                                         DONE (in 1s)
INFO: (2 of 7) Corpora preprocessing...                                DONE (in 4s)
ERROR Unexpected exception:
    Command 'java -cp /home/interact/mmt/build/mmt-0.14.jar -Dmmt.home=/home/interact/mmt -Djava.library.path=/home/interact/mmt/build/lib eu.modernmt.cli.TrainingPipelineMain -s en -t zh -v /home/interact/mmt/engines/enzh_tmp/models/vocabulary --output /home/interact/mmt/runtime/enzh_tmp/tmp/training/preprocessed --input /home/interact/mmt/runtime/enzh_tmp/tmp/training/training_corpora --dev /home/interact/mmt/engines/enzh_tmp/data/dev --test /home/interact/mmt/engines/enzh_tmp/data/test' failed with exit code 1
Traceback (most recent call last):
  File "./mmt", line 583, in <module>
    main()
  File "./mmt", line 561, in main
    actions[command](args)
  File "./mmt", line 383, in main_create
    debug=args.debug, steps=args.training_steps, split_trainingset=args.split_corpora)
  File "/home/interact/mmt/cli/engine.py", line 209, in build
    (self._engine.data_path if split_trainingset else None)
  File "/home/interact/mmt/cli/mt/processing.py", line 142, in process
    shell.execute(command, stdin=shell.DEVNULL, stdout=shell.DEVNULL, stderr=shell.DEVNULL)
  File "/home/interact/mmt/cli/libs/shell.py", line 55, in execute
    raise ShellError(str_cmd, returncode, stderr_dump)
cli.libs.shell.ShellError: Command 'java -cp /home/interact/mmt/build/mmt-0.14.jar -Dmmt.home=/home/interact/mmt -Djava.library.path=/home/interact/mmt/build/lib eu.modernmt.cli.TrainingPipelineMain -s en -t zh -v /home/interact/mmt/engines/enzh_tmp/models/vocabulary --output /home/interact/mmt/runtime/enzh_tmp/tmp/training/preprocessed --input /home/interact/mmt/runtime/enzh_tmp/tmp/training/training_corpora --dev /home/interact/mmt/engines/enzh_tmp/data/dev --test /home/interact/mmt/engines/enzh_tmp/data/test' failed with exit code 1

$ java -cp /home/interact/mmt/build/mmt-0.14.jar -Dmmt.home=/home/interact/mmt -Djava.library.path=/home/interact/mmt/build/lib eu.modernmt.cli.TrainingPipelineMain -s en -t zh -v /home/interact/mmt/engines/enzh_tmp/models/vocabulary --output /home/interact/mmt/runtime/enzh_tmp/tmp/training/preprocessed --input /home/interact/mmt/runtime/enzh_tmp/tmp/training/training_corpora --dev /home/interact/mmt/engines/enzh_tmp/data/dev --test /home/interact/mmt/engines/enzh_tmp/data/test
Exception in thread "main" java.lang.IllegalArgumentException: Parameter 'directory' is not a directory
    at org.apache.commons.io.FileUtils.validateListFilesParameters(FileUtils.java:545)
    at org.apache.commons.io.FileUtils.listFiles(FileUtils.java:521)
    at eu.modernmt.model.corpus.Corpora.list(Corpora.java:42)
    at eu.modernmt.cli.TrainingPipelineMain.main(TrainingPipelineMain.java:77)

$ ls ~/mmt/engines/enzh_tmp/models
db

$ ls -a ~/mmt/runtime/enzh_tmp
.  ..  tmp

I can train the en-it and it-en examples with no issues.

$ ./mmt create it en examples/data/train -e iten_example

=========== TRAINING STARTED ===========

ENGINE:  iten_example
BILINGUAL CORPORA: 3 documents
MONOLINGUAL CORPORA: 0 documents
LANGS:   it > en

INFO: (1 of 7) TMs clean-up...                                         DONE (in 0s)
INFO: (2 of 7) Corpora preprocessing...                                DONE (in 3s)
INFO: (3 of 7) Context Analyzer training...                            DONE (in 1s)
INFO: (4 of 7) Aligner training...                                     DONE (in 2s)
INFO: (5 of 7) Translation Model training...                           DONE (in 2s)
INFO: (6 of 7) Language Model training...                              DONE (in 2s)
INFO: (7 of 7) Writing config files...                                 DONE (in 0s)

=========== TRAINING SUCCESS ===========

You can now start, stop or check the status of the server with command:
    ./mmt start|stop|status -e iten_example

$ ls ~/mmt/engines/iten_example/models
align  context  db  lm  moses.ini  sapt  vocabulary
NickRuiz commented 7 years ago

I noticed that I had some blank lines, so I removed them with ~/mmt/vendor/moses/scripts/training/clean-corpus-n.perl TED.train en zh TED.train.clean 1 100000. However, the error persists.

NickRuiz commented 7 years ago

I also created a dummy project, where I copied one line from the English side and changed the extension. Something like this: head -1 tmp.en tmp.es

This trains fine. But if I change the extension like so mv tmp.es tmp.zh and try to train with the zh language, the problem occurs. It seems like there is a step in the pipeline that doesn't like zh.

davidecaroselli commented 7 years ago

Hi @NickRuiz

I've just published the fix in the master branch, could you confirm that it solves this issue also in your installation?

Thanks!