moses-smt / mosesdecoder

Moses, the machine translation system
http://www.statmt.org/moses
GNU Lesser General Public License v2.1
1.58k stars 775 forks source link

train-model.perl not working out-of-the-box #112

Closed alvations closed 9 years ago

alvations commented 9 years ago

When i tried the following on Ubuntu 14.10:

# Make a test directory
mkdir test-out-of-box
cd test-out-of-box/

# Get Europarl DE-EN corpus
wget http://opus.lingfil.uu.se/Europarl/wordalign/de-en/de -O Europarl.de-en.de
wget http://opus.lingfil.uu.se/Europarl/wordalign/de-en/en -O Europarl.de-en.en

# Download Out-of-the-box pre-compiled training-tools
wget -r --no-parent http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/training-tools/
mv www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/training-tools/ .
rm training-tools/index*
rm -rf www.statmt.org/

# Download `train-model.perl`
wget http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/scripts/training/train-model.perl
wget http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/scripts/training/LexicalTranslationModel.pm

# Test run to ensure perl script recognize training tools directory
perl train-model.perl -external-bin-dir training-tools/ -mgiza

It throws the error:

Using SCRIPTS_ROOTDIR: /home/alvas/test-out-of-box
Using multi-thread GIZA
ERROR: Cannot find /home/alvas/test-out-of-box/training-tools/merge_alignment.py at train-model.perl line 285

When I tried the full path:

perl train-model.perl -external-bin-dir /home/alvas/test-out-of-box/training-tools/ -mgiza

It throws the same error.

Any clues to why this happens?

For diagnostics, here's the directory structure:

alvas@ubi:~/test-out-of-box$ ls
Europarl.de-en.de  Europarl.de-en.en  LexicalTranslationModel.pm  training-tools  train-model.perl

alvas@ubi:~/test-out-of-box$ cd training-tools/
alvas@ubi:~/test-out-of-box/training-tools$ ls
d4norm  hmmnorm  merge_alignment.py  mgiza  mkcls  plain2snt  snt2cooc  snt2coocrmp  snt2plain  symal

alvas@ubi:~/test-out-of-box/training-tools$ head merge_alignment.py 
#!/usr/bin/env python
# Author : Qin Gao
# Date   : Dec 31, 2007
# Purpose: Combine multiple alignment files into a single one, the files are
#          prodcuced by MGIZA, which has sentence IDs, and every file is 
#          ordered inside

from __future__ import unicode_literals
import sys
import re
alvations commented 9 years ago

This is interesting.

Seems like there's some permission problems when I tried to get the training tools through wget:

alvas@ubi:~/test-out-of-box/training-tools$ ls -lah *
-rw-rw-r-- 1 alvas alvas 914K Jan 29 16:16 d4norm
-rw-rw-r-- 1 alvas alvas 919K Jan 29 16:16 hmmnorm
-rw-rw-r-- 1 alvas alvas 2.1K Jan 29 16:16 merge_alignment.py
-rw-rw-r-- 1 alvas alvas 1.1M Jan 29 16:16 mgiza
-rw-rw-r-- 1 alvas alvas 336K Jan 29 16:16 mkcls
-rw-rw-r-- 1 alvas alvas  43K Jan 29 16:16 plain2snt
-rw-rw-r-- 1 alvas alvas  38K Jan 29 16:16 snt2cooc
-rw-rw-r-- 1 alvas alvas  29K Jan 29 16:16 snt2coocrmp
-rw-rw-r-- 1 alvas alvas  33K Jan 29 16:16 snt2plain
-rw-rw-r-- 1 alvas alvas  48K Jan 29 16:16 symal

After I did a chmod 777, it works:

alvas@ubi:~/test-out-of-box/training-tools$ ls
d4norm  hmmnorm  merge_alignment.py  mgiza  mkcls  plain2snt  snt2cooc  snt2coocrmp  snt2plain  symal
alvas@ubi:~/test-out-of-box/training-tools$ chmod 777 *
alvas@ubi:~/test-out-of-box/training-tools$ cd ..
alvas@ubi:~/test-out-of-box$ ls
Europarl.de-en.de  Europarl.de-en.en  LexicalTranslationModel.pm  training-tools  train-model.perl
alvas@ubi:~/test-out-of-box$ perl train-model.perl --external-bin-dir training-tools/ --mgiza
Using SCRIPTS_ROOTDIR: /home/alvas/test-out-of-box
Using multi-thread GIZA
using gzip 
ERROR: use --corpus to specify corpus at train-model.perl line 379.

But is there a safer way to change the permission? What sorts of permission does train-model.perl need? Doing chmod 777 works but it's a little unsafe.

alvations commented 9 years ago

It's sort of digging into the closet but seems like train-model.perl is behaving weirdly.

When I ran:

perl train-model.perl --root-dir .  --model-dir model --corpus Europarl.de-en --f en --e de  --external-bin-dir "training-tools" --mgiza --parallel --first-step 1 --last-step 3

mkcls and mgiza completes and when the script is trying to stitch the results, train-model.perl starts to behave weirdly and looks for the moses/bin/symal instead of $_EXTERNAL_BINDIR/symal.

Using SCRIPTS_ROOTDIR: /home/alvas/test-out-of-box
Using multi-thread GIZA
using gzip 
(1) preparing corpus @ Tue May 19 02:05:17 CEST 2015
Executing: mkdir -p /home/alvas/test-out-of-box/corpus
(1.0) selecting factors @ Tue May 19 02:05:17 CEST 2015
Forking...
(1.1) running mkcls  @ Tue May 19 02:05:17 CEST 2015
/home/alvas/test-out-of-box/training-tools/mkcls -c50 -n2 -p/home/alvas/test-out-of-box/Europarl.de-en.en -V/home/alvas/test-out-of-box/corpus/en.vcb.classes opt
  /home/alvas/test-out-of-box/corpus/en.vcb.classes already in place, reusing
(1.2) creating vcb file /home/alvas/test-out-of-box/corpus/en.vcb @ Tue May 19 02:05:17 CEST 2015
(1.1) running mkcls  @ Tue May 19 02:05:17 CEST 2015
/home/alvas/test-out-of-box/training-tools/mkcls -c50 -n2 -p/home/alvas/test-out-of-box/Europarl.de-en.de -V/home/alvas/test-out-of-box/corpus/de.vcb.classes opt
  /home/alvas/test-out-of-box/corpus/de.vcb.classes already in place, reusing
(1.2) creating vcb file /home/alvas/test-out-of-box/corpus/de.vcb @ Tue May 19 02:05:17 CEST 2015
(1.3) numberizing corpus /home/alvas/test-out-of-box/corpus/en-de-int-train.snt @ Tue May 19 02:05:17 CEST 2015
  /home/alvas/test-out-of-box/corpus/en-de-int-train.snt already in place, reusing
(1.3) numberizing corpus /home/alvas/test-out-of-box/corpus/de-en-int-train.snt @ Tue May 19 02:05:17 CEST 2015
  /home/alvas/test-out-of-box/corpus/de-en-int-train.snt already in place, reusing
Waiting for mkcls processes to finish...
(2) running giza @ Tue May 19 02:05:17 CEST 2015
(2.1a) running snt2cooc de-en @ Tue May 19 02:05:17 CEST 2015

Executing: mkdir -p /home/alvas/test-out-of-box/giza.de-en
/home/alvas/test-out-of-box/training-tools/snt2cooc /home/alvas/test-out-of-box/giza.de-en/de-en.cooc /home/alvas/test-out-of-box/corpus/en.vcb /home/alvas/test-out-of-box/corpus/de.vcb /home/alvas/test-out-of-box/corpus/de-en-int-train.snt
Executing: /home/alvas/test-out-of-box/training-tools/snt2cooc /home/alvas/test-out-of-box/giza.de-en/de-en.cooc /home/alvas/test-out-of-box/corpus/en.vcb /home/alvas/test-out-of-box/corpus/de.vcb /home/alvas/test-out-of-box/corpus/de-en-int-train.snt
(2.1a) running snt2cooc en-de @ Tue May 19 02:05:17 CEST 2015

Executing: mkdir -p /home/alvas/test-out-of-box/giza.en-de
/home/alvas/test-out-of-box/training-tools/snt2cooc /home/alvas/test-out-of-box/giza.en-de/en-de.cooc /home/alvas/test-out-of-box/corpus/de.vcb /home/alvas/test-out-of-box/corpus/en.vcb /home/alvas/test-out-of-box/corpus/en-de-int-train.snt
Executing: /home/alvas/test-out-of-box/training-tools/snt2cooc /home/alvas/test-out-of-box/giza.en-de/en-de.cooc /home/alvas/test-out-of-box/corpus/de.vcb /home/alvas/test-out-of-box/corpus/en.vcb /home/alvas/test-out-of-box/corpus/en-de-int-train.snt
END.
END.
(2.1b) running giza de-en @ Tue May 19 02:05:17 CEST 2015
/home/alvas/test-out-of-box/training-tools/mgiza  -CoocurrenceFile /home/alvas/test-out-of-box/giza.de-en/de-en.cooc -c /home/alvas/test-out-of-box/corpus/de-en-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 4 -nodumps 1 -nsmooth 4 -o /home/alvas/test-out-of-box/giza.de-en/de-en -onlyaldumps 1 -p0 0.999 -s /home/alvas/test-out-of-box/corpus/en.vcb -t /home/alvas/test-out-of-box/corpus/de.vcb
  /home/alvas/test-out-of-box/giza.de-en/de-en.A3.final.gz seems finished, reusing.
Waiting for second GIZA process...
(2.1b) running giza en-de @ Tue May 19 02:05:17 CEST 2015
/home/alvas/test-out-of-box/training-tools/mgiza  -CoocurrenceFile /home/alvas/test-out-of-box/giza.en-de/en-de.cooc -c /home/alvas/test-out-of-box/corpus/en-de-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 4 -nodumps 1 -nsmooth 4 -o /home/alvas/test-out-of-box/giza.en-de/en-de -onlyaldumps 1 -p0 0.999 -s /home/alvas/test-out-of-box/corpus/de.vcb -t /home/alvas/test-out-of-box/corpus/en.vcb
  /home/alvas/test-out-of-box/giza.en-de/en-de.A3.final.gz seems finished, reusing.
(3) generate word alignment @ Tue May 19 02:05:17 CEST 2015
Combining forward and inverted alignment from files:
  /home/alvas/test-out-of-box/giza.en-de/en-de.A3.final.{bz2,gz}
  /home/alvas/test-out-of-box/giza.de-en/de-en.A3.final.{bz2,gz}
Executing: mkdir -p /home/alvas/test-out-of-box/model
Executing: /home/alvas/test-out-of-box/training/giza2bal.pl -d "gzip -cd /home/alvas/test-out-of-box/giza.de-en/de-en.A3.final.gz" -i "gzip -cd /home/alvas/test-out-of-box/giza.en-de/en-de.A3.final.gz" |/home/alvas/test-out-of-box/../bin/symal -alignment="grow" -diagonal="yes" -final="yes" -both="no" > /home/alvas/test-out-of-box/model/aligned.grow-diag-final
sh: 1: /home/alvas/test-out-of-box/training/giza2bal.pl: not found
sh: 1: /home/alvas/test-out-of-box/../bin/symal: not found
Exit code: 127
ERROR: Can't generate symmetrized alignment file

Also, $SCRIPTS_ROOTDIR seems to be controlling where train-model.perl finds the complimentary scripts. This is unavoidable, unless we allow $SCRIPTS_ROOTDIR to be customize-able but it will lead to a whole lot of other problems.

alvations commented 9 years ago

Solution: Use Moses scripts as they are compiled and installed normally.

Enlightenment: Training scripts don't work out of the box.

For more info: https://github.com/alvations/usaarhat-repo/blob/master/Align-A-Line.md

jtv commented 9 years ago

Use the ‘x’ permission bit on anything that you want to be able to execute. Strictly speaking if it's in your home directory you probably only need that permission for the file's owner (you), but the usual and simple thing is to allow it for all users.

So, to permit execution of a file, do::

chmod a+x $MYFILE
alvations commented 9 years ago

@jtv, thanks for the chmod permission solution!!! But the problems that comes after the permission is a little harder to resolve because it's closely tied to the pseudo-static path that train-model.perl tries to use.

goodmami commented 8 years ago

Sorry to jump into a closed thread, but I'm having a similar issue and I'm not sure why this was closed. train-model.perl is failing to find symal because it's looking for "$SCRIPTS_ROOTDIR/../bin/symal" and not "$_EXTERNAL_BINDIR/symal" or even the symal in the Moses bin dir (which for me is not a sibling of $SCRIPTS_ROOTDIR). Here's the offending line: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/train-model.perl#L466

I'm using EMS, and here are the relevant paths from my config file:

moses-src-dir = /NLP_TOOLS/mt_tools/moses/v3.0-release
moses-bin-dir = $moses-src-dir/bin
moses-script-dir = $moses-src-dir/src/scripts
external-bin-dir = /NLP_TOOLS/mt_tools/mgizapp/latest/bin

Note that the while the bin-dir is under $moses-src-dir/bin, the script-dir is another level lower ($moses-src-dir/src/scripts). This install is on my university's cluster and I don't have permissions to move things around.

Why does train-model.perl assume the bin-dir is a sibling to the script-dir when it has both the $moses-bin-dir and $external-bin-dir variables available?

alvations commented 8 years ago

@goodmami Last year, I ended up modifying the path in the train-model.perl to suit my machine. I've changed all the path to the binaries and path to other specific perl scripts with static path.

The assumption for my $SYMAL = "$SCRIPTS_ROOTDIR/../bin/symal"; is because it assumes that Moses is installed as per the instructions from http://www.statmt.org/moses/?n=Development.GetStarted such that the moses is installed with path like this:

alvas@ubi:~$ cd mosesdecoder/
alvas@ubi:~/mosesdecoder$ ls
biconcor                defer      mert       OnDiskPt            scripts
bin                     doc        mingw      phrase-extract      search
bjam                    jam-files  mira       previous.sh         symal
BUILD-INSTRUCTIONS.txt  Jamroot    misc       regression-testing  util
contrib                 lib        moses      sample-models       vw
cruise-control          lm         moses-cmd  sample-models.tgz
alvas@ubi:~/mosesdecoder$ cd scripts/
alvas@ubi:~/mosesdecoder/scripts$ ls
analysis     generic  other    regression-testing  tests      Transliteration
ems          Jamfile  README   server              tokenizer
fuzzy-match  OSM      recaser  share               training
alvas@ubi:~/mosesdecoder/scripts$ cd ../bin
alvas@ubi:~/mosesdecoder/bin$ ls
1-1-Extraction        filter                    processLexicalTable
biconcor              fragment                  processPhraseTable
build_binary          generateSequences         project-cache.jam
config.log            kbmira                    prunePhraseTable
consolidate           lexical-reordering-score  query
consolidate-direct    lmbrgrid                  queryLexicalTable
consolidate-reverse   lmplz                     queryOnDiskPt
CreateOnDiskPt        merge-sorted              queryPhraseTable
dump_counts           mert                      relax-parse
evaluator             mira                      score
extract               moses                     sentence-bleu
extract-ghkm          moses_chart               statistics
extract-lex           pcfg-extract              symal
extract-mixed-syntax  pcfg-score                TMining
extractor             phrase-lookup
extract-rules         pro
alvations commented 8 years ago

The TL;D12R way would be something like:

cd /path/to/
wget http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/linux-64bit.tgz
tar zxvf linux-64bit.tgz
mv linux-64bit mosesdecoder
chmod a+x -R mosesdecoder

Since the scripts and EMS should not use the source directly, In the config file, you can do this:

moses-src-dir = /path/to/mosesdecoder
moses-bin-dir = $moses-src-dir/bin
moses-script-dir = $moses-src-dir/scripts
external-bin-dir = $moses-src-dir/training-tools
goodmami commented 8 years ago

Thanks @alvations. I didn't install it myself, and our sysadmin claims to have followed the normal install. According to the link you provided (http://www.statmt.org/moses/?n=Development.GetStarted) (emphasis added):

--install-scripts=/path/to/scripts copies scripts into a directory. Does not install if missing. No argument defaults to PREFIX/scripts.

Since the directory didn't exist as a sibling to the bindir, I'm guessing he didn't provide the --install-scripts option, which in the installation instructions is under "Popular additional bjam options" and not the "easy setup" heading. Even if the option is used, it's possible to provide a path that isn't the default, in which case the train-model.perl script would still fail because of the directory location assumption.

Anyway, we fixed that problem by symlinking the scripts directory at the expected location, but my original question still stands (emphasis added):

Why does train-model.perl assume the bin-dir is a sibling to the script-dir when it has both the $moses-bin-dir and $external-bin-dir variables available?

I think this hardcoding of the path assumption is a bug. I'd be happy to submit a PR, but I'm not sure what to fix. Maybe I'd just need to change whatever calls train-model.perl to provide the appropriate command-line options, but maybe I'd also need to change train-model.perl to actually use them?

Thanks!