Closed ghozn closed 5 years ago
Which command does run_gecsmt.py
fail to run exactly, i.e. what is the last "Run: ..." command displayed? The script is just a wrapper around a bunch of commands, I would try to run the one that fails separately and debug it.
I guess it's an issue with truecasing. Maybe your downloaded wiki.blm is corrupted or you don't have enough RAM memory to load it? The file should be 22284721487 bytes large and have md5sum of 2aca82a57645b3a81865776c49353e27.
It might be helpful to look at differences between your output and the provided outputs.
Thank you for your reply. The command is:
Found LM: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm
Found WC: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.classes.gz
Found sparse features
Run: grep '^S' models/conll14st-test-data/noalt/official-2014.combined.m2 | cut -c3- > /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in
Run: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/m2_tok/detokenize.py < /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in | /Users/admin/fhs/smt-baseline/moses/mosesdecoder-master/scripts/tokenizer/tokenizer.perl -threads 8 | /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/case_graph.perl --lm /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm --decode /Users/admin/fhs/smt-baseline/lazy/lazy-master/bin/decode
Tokenizer Version 1.1
Language: en
Number of threads: 8
Using 8 threads
Creating Graphs
Loading /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm
Recasing
util/file.cc:138 in std::size_t util::PartialRead(int, void , std::size_t) threw FDException because `ret < 0'.
Invalid argument in fd 3 while reading 21992807322 bytes File: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm
Done
Run: mv /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok.nowc
Run: perl /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/anottext.pl -f /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.classes.gz < /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok.nowc > /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok
^CTraceback (most recent call last):
File "models/run_gecsmt.py", line 192, in
It seems my file wiki.blm size is different from your's. It's strange that I can run the script and get result without reading the file wiki.blm. I will download the wiki package and try it again. Thank you very much!
Hi snukky, I download the wiki package using Chorme and thunder but the file size I got is the same as before, which is 22.28 GB. I think the possibility of file corruption can be excluded. And my RAM memory is 16GB.
May I ask what is the function of lazy decoder? Will it affect the final result if I don't use it?
It was used for recasing the output. The evaluation is case-sensitive, so it may impact the results.
In the evaluation method m2socrer_fork, we can use the parameter ignore_whitespace_casing which help to ignore the difference in capitalization. Will this have same effect?
The issue is fixed, thanks!
The scores from m2scorer with --ignore_whitespace_casing
shouldn't be directly compared with other results that were obtained using the default settings.
The error is fixed by changing the code in lazy, thank you for your careful reply.
Hi: Thank you for open-source your fantastic work. I encounter an error while running the script run_gecsmt.py. The error message is as follow:
util/file.cc:138 in std::size_t util::PartialRead(int, void *, std::size_t) threw FDException because `ret < 0'. Invalid argument in fd 3 while reading 21992807322 bytes File: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm Done
I try to run the script tokenizer.perl Individually for tokenizing the data, and it work to a certain degree. But the M2 score I got is far from the result in the paper: Precision : 0.5617 Recall : 0.2371 F_0.5 : 0.4409 At the same time, I run the evaluation script with the sparse output in the folder 'output' and get: Precision : 0.5854 Recall : 0.2493 F_0.5 : 0.4610 There is a huge difference between my result and your result. where is my problem here?