grammatical / baselines-emnlp2016

Baseline models, training scripts, and instructions on how to reproduce our results for our state-of-art grammar correction system from M. Junczys-Dowmunt, R. Grundkiewicz: Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction, EMNLP 2016.
MIT License
69 stars 17 forks source link

FDException while reading wikilm/wiki.blm #8

Closed ghozn closed 5 years ago

ghozn commented 5 years ago

Hi: Thank you for open-source your fantastic work. I encounter an error while running the script run_gecsmt.py. The error message is as follow:

util/file.cc:138 in std::size_t util::PartialRead(int, void *, std::size_t) threw FDException because `ret < 0'. Invalid argument in fd 3 while reading 21992807322 bytes File: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm Done

I try to run the script tokenizer.perl Individually for tokenizing the data, and it work to a certain degree. But the M2 score I got is far from the result in the paper: Precision : 0.5617 Recall : 0.2371 F_0.5 : 0.4409 At the same time, I run the evaluation script with the sparse output in the folder 'output' and get: Precision : 0.5854 Recall : 0.2493 F_0.5 : 0.4610 There is a huge difference between my result and your result. where is my problem here?

snukky commented 5 years ago

Which command does run_gecsmt.py fail to run exactly, i.e. what is the last "Run: ..." command displayed? The script is just a wrapper around a bunch of commands, I would try to run the one that fails separately and debug it.

I guess it's an issue with truecasing. Maybe your downloaded wiki.blm is corrupted or you don't have enough RAM memory to load it? The file should be 22284721487 bytes large and have md5sum of 2aca82a57645b3a81865776c49353e27.

It might be helpful to look at differences between your output and the provided outputs.

ghozn commented 5 years ago

Thank you for your reply. The command is: Found LM: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm Found WC: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.classes.gz Found sparse features Run: grep '^S' models/conll14st-test-data/noalt/official-2014.combined.m2 | cut -c3- > /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in Run: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/m2_tok/detokenize.py < /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in | /Users/admin/fhs/smt-baseline/moses/mosesdecoder-master/scripts/tokenizer/tokenizer.perl -threads 8 | /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/case_graph.perl --lm /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm --decode /Users/admin/fhs/smt-baseline/lazy/lazy-master/bin/decode Tokenizer Version 1.1 Language: en Number of threads: 8 Using 8 threads Creating Graphs Loading /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm Recasing util/file.cc:138 in std::size_t util::PartialRead(int, void , std::size_t) threw FDException because `ret < 0'. Invalid argument in fd 3 while reading 21992807322 bytes File: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm Done Run: mv /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok.nowc Run: perl /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/anottext.pl -f /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.classes.gz < /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok.nowc > /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok ^CTraceback (most recent call last): File "models/run_gecsmt.py", line 192, in main() File "models/run_gecsmt.py", line 50, in main .format(scripts=args.scripts, wc=WC, pfx=prefix)) KeyboardInterrupt (fhs) GHOZNFAN-MC0:baselines-emnlp2016-master admin$ python models/run_gecsmt.py -f models/moses.sparse.mert.avg.ini -i models/conll14st-test-data/noalt/official-2014.combined.m2 --moses /Users/admin/fhs/smt-baseline/moses/mosesdecoder-master --scripts /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts -w /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/ --lazy /Users/admin/fhs/smt-baseline/lazy/lazy-master --m2 Found LM: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm Found WC: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.classes.gz Found sparse features Run: grep '^S' models/conll14st-test-data/noalt/official-2014.combined.m2 | cut -c3- > /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in Run: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/m2_tok/detokenize.py < /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in | /Users/admin/fhs/smt-baseline/moses/mosesdecoder-master/scripts/tokenizer/tokenizer.perl -threads 8 | /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/case_graph.perl --lm /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm --decode /Users/admin/fhs/smt-baseline/lazy/lazy-master/bin/decode Tokenizer Version 1.1 Language: en Number of threads: 8 Using 8 threads Creating Graphs Loading /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm Recasing util/file.cc:138 in std::size_t util::PartialRead(int, void , std::size_t) threw FDException because `ret < 0'. Invalid argument in fd 3 while reading 21992807322 bytes File: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm Done Run: mv /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok.nowc Run: perl /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/train/scripts/anottext.pl -f /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.classes.gz < /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok.nowc > /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok Run: /Users/admin/fhs/smt-baseline/moses/mosesdecoder-master/bin/moses -f models/moses.sparse.mert.avg.ini --alignment-output-file /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.out.tok.aln -threads 8 -fd '|' < /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.in.tok > /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.out.tok Defined parameters (per moses.ini or switch): alignment-output-file: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/workdir/official-2014.combined.out.tok.aln config: models/moses.sparse.mert.avg.ini distortion-limit: 1 factor-delimiter: | feature: CorrectionPattern factor=0 context=1 context-factor=1 CorrectionPattern factor=1 OpSequenceModel path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/models/data/osm.kenlm input-factor=0 output-factor=0 support-features=no num-features=1 EditOps scores=dis Generation name=Generation0 num-features=0 input-factor=0 output-factor=1 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.classes.gz UnknownWordPenalty WordPenalty PhrasePenalty PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/models/data/phrase-table.0-0.gz input-factor=0 output-factor=0 KENLM lazyken=0 name=LM0 factor=0 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/models/data/lm.cor.kenlm order=5 KENLM lazyken=0 name=LM1 factor=0 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm order=5 KENLM lazyken=0 name=LM2 factor=1 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.wclm.kenlm order=9 input-factors: 0 1 mapping: 0 T 0 0 G 0 search-algorithm: 1 threads: 8 weight: OpSequenceModel0= 0.055870166 EditOps0= 0.095882946 0.060318618 0.246521405 UnknownWordPenalty0= 0.000000000 WordPenalty0= 0.035565091 PhrasePenalty0= 0.215996473 TranslationModel0= 0.054458182 0.075304854 0.047920170 -0.002638787 LM0= 0.029492606 LM1= 0.058176788 LM2= 0.021853913 weight-file: /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/models/sparse/moses.wiki.sparse line=CorrectionPattern factor=0 context=1 context-factor=1 Initializing correction pattern feature.. FeatureFunction: CorrectionPattern0 start: 0 end: 18446744073709551615 line=CorrectionPattern factor=1 Initializing correction pattern feature.. FeatureFunction: CorrectionPattern1 start: 0 end: 18446744073709551615 line=OpSequenceModel path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/models/data/osm.kenlm input-factor=0 output-factor=0 support-features=no num-features=1 FeatureFunction: OpSequenceModel0 start: 0 end: 0 line=EditOps scores=dis Initializing EditOps feature.. FeatureFunction: EditOps0 start: 1 end: 3 line=Generation name=Generation0 num-features=0 input-factor=0 output-factor=1 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.classes.gz FeatureFunction: Generation0 start: 4 end: 3 line=UnknownWordPenalty FeatureFunction: UnknownWordPenalty0 start: 4 end: 4 line=WordPenalty FeatureFunction: WordPenalty0 start: 5 end: 5 line=PhrasePenalty FeatureFunction: PhrasePenalty0 start: 6 end: 6 line=PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/models/data/phrase-table.0-0.gz input-factor=0 output-factor=0 FeatureFunction: TranslationModel0 start: 7 end: 10 line=KENLM lazyken=0 name=LM0 factor=0 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/models/data/lm.cor.kenlm order=5 FeatureFunction: LM0 start: 11 end: 11 line=KENLM lazyken=0 name=LM1 factor=0 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.blm order=5 FeatureFunction: LM1 start: 12 end: 12 line=KENLM lazyken=0 name=LM2 factor=1 path=/Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/wikilm/wiki.wclm.kenlm order=9 FeatureFunction: LM2 start: 13 end: 13 Loading CorrectionPattern0 Loading CorrectionPattern1 Loading OpSequenceModel0 Loading EditOps0 Loading Generation0 Loading UnknownWordPenalty0 Loading WordPenalty0 Loading PhrasePenalty0 Loading LM0 Loading LM1 Loading LM2 Loading TranslationModel0 Start loading text phrase table. Moses format : [51.891] seconds Reading /Users/admin/fhs/smt-baseline/baselines-emnlp2016-master/models/data/phrase-table.0-0.gz

It seems my file wiki.blm size is different from your's. It's strange that I can run the script and get result without reading the file wiki.blm. I will download the wiki package and try it again. Thank you very much!

ghozn commented 5 years ago

Hi snukky, I download the wiki package using Chorme and thunder but the file size I got is the same as before, which is 22.28 GB. I think the possibility of file corruption can be excluded. And my RAM memory is 16GB.

ghozn commented 5 years ago

May I ask what is the function of lazy decoder? Will it affect the final result if I don't use it?

snukky commented 5 years ago

It was used for recasing the output. The evaluation is case-sensitive, so it may impact the results.

ghozn commented 5 years ago

In the evaluation method m2socrer_fork, we can use the parameter ignore_whitespace_casing which help to ignore the difference in capitalization. Will this have same effect?

ghozn commented 5 years ago

The issue is fixed, thanks!

snukky commented 5 years ago

The scores from m2scorer with --ignore_whitespace_casing shouldn't be directly compared with other results that were obtained using the default settings.

ghozn commented 5 years ago

The error is fixed by changing the code in lazy, thank you for your careful reply.