nusnlp / mlconvgec2018

Code and model files for the paper: "A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction" (AAAI-18).
GNU General Public License v3.0
185 stars 73 forks source link

'Levenshtein greater than source size' when training re-ranker #29

Closed s-lilo closed 4 years ago

s-lilo commented 5 years ago

Hi!

I am trying to train your model but I keep running into different errors when trying to train the re-ranker. If I use the dev.m2 file (which contains all the original annotations, without applying BPE), I can finish training but I get a lot of "Levenshtein distance is greater than source size" warnings. I did the initial preprocessing of the dataset I'm using with my own scripts (mostly for correcting the sentences), so I'm not sure whether the dev.m2 file used is the same as mine. I have checked the prepare_data.sh script and it seems it might just be the original annotation file? I have also tried using the .src and .trg dev sets, which are split with BPE unlike the .m2 file, and the error changes to 'segmentation fault' (which makes sense as the nbest outputs have been 'de-bpe-ized'). This is what I get when I try using the src file just in case:

+ python2.7 /home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py -i /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt -r processed/dev.src -c /DATA/slima_data/models/training//rerank_config.ini --threads 12 --tuning-metric m2 --predictable-seed -o /DATA/slima_data/models/training/ --moses-dir ../../mosesdecoder --no-add-weight [INFO] [01-10-2019 10:45:29] Arguments: [INFO] [01-10-2019 10:45:29] alg: mert [INFO] [01-10-2019 10:45:29] command: /home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py -i /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt -r processed/dev.src -c /DATA/slima_data/models/training//rerank_config.ini --threads 12 --tuning-metric m2 --predictable-seed -o /DATA/slima_data/models/training/ --moses-dir ../../mosesdecoder --no-add-weight [INFO] [01-10-2019 10:45:29] init_value: 0.05 [INFO] [01-10-2019 10:45:29] input_config: /DATA/slima_data/models/training//rerank_config.ini [INFO] [01-10-2019 10:45:29] input_nbest: /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt [INFO] [01-10-2019 10:45:29] metric: m2 [INFO] [01-10-2019 10:45:29] moses_dir: ../../mosesdecoder [INFO] [01-10-2019 10:45:29] no_add_weight: True [INFO] [01-10-2019 10:45:29] out_dir: /DATA/slima_data/models/training/ [INFO] [01-10-2019 10:45:29] pred_seed: True [INFO] [01-10-2019 10:45:29] ref_paths: processed/dev.src [INFO] [01-10-2019 10:45:29] threads: 12 [INFO] [01-10-2019 10:45:29] Reading weights from config file [INFO] [01-10-2019 10:45:29] Feature weights: ['F0= 0.5', 'EditOps0= 0.2 0.2 0.2'] [INFO] [01-10-2019 10:45:29] Extracting stats and features [WARNING] The optional arguments of extractor are not used yet [INFO] [01-10-2019 10:45:29] Executing command: ../../mosesdecoder/bin/extractor --sctype M2SCORER --scconfig ignore_whitespace_casing:true -r processed/dev.src -n /DATA/slima_data/models/training//augmented.nbest --scfile /DATA/slima_data/models/training//statscore.data --ffile /DATA/slima_data/models/training//features.data Binary write mode is NOT selected Scorer type: M2SCORER name: ignore_whitespace_casing value: true Segmentation fault (core dumped) [INFO] [01-10-2019 10:45:29] Running MERT [INFO] [01-10-2019 10:45:29] Command: ../../mosesdecoder/bin/mert -d 4 -S /DATA/slima_data/models/training//statscore.data -F /DATA/slima_data/models/training//features.data --ifile /DATA/slima_data/models/training//init.opt --threads 12 -r 1 --sctype M2SCORER --scconfig ignore_whitespace_casing:true shard_size = 0 shard_count = 0 Seeding random numbers with 1 name: ignore_whitespace_casing value: true Data::m_score_type M2Scorer Data::Scorer type from Scorer: M2Scorer Loading Data from: /DATA/slima_data/models/training//statscore.data and /DATA/slima_data/models/training//features.data loading feature data from /DATA/slima_data/models/training//features.data loading score data from /DATA/slima_data/models/training//statscore.data Data loaded : [Wall 0.000461 CPU 0.000457] seconds. Creating a pool of 12 threads terminate called recursively terminate called after throwing an instance of 'std::runtime_error' Aborted (core dumped) [INFO] [01-10-2019 10:45:29] Optimization complete. Traceback (most recent call last): File "/home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py", line 93, in <module> assert os.path.isfile('weights.txt') AssertionError

Any idea of what the problem might be? I think it's probably that the file I used for the -r argument of the re-ranker train.py script might be different to the one you used because of the preprocessing. I also trained my own BPE model and my own embeddings. As you can see in the error excerpt, right now I'm trying to train using only eo features, if that matters.

Thank you in advance, and thank you for making your code open too. It's very helpful.

Edit: nevermind, it was my fault because I forgot I shuffled the dev sentences when preprocessing them. Thank you anyway and sorry!

Edit 2: closed too early because I thought it was just a stupid mistake but the problem persists... should the m2 file not include the annotations?

shamilcm commented 5 years ago

If you are training the re-ranker with M2 metric, you need to use an M2 format file (released with CoNLL-2014 & 2013 test sets). This is similar to how dev.m2 prepared by the released scripts looks like (format is available in the README.md https://github.com/nusnlp/m2scorer). The reason for "Levenshtein distance" error while training is because we use an external implementation of M2 in Moses framework for training, which skips distance source/target sentences possibly for efficiency reasons and prints the error you got. Alternatively, we've recently implemented and released another version of n-best reranker (https://github.com/nusnlp/crosentgec/blob/master/tools/nbest-reranker/) that has an in-built M2 implementation that does not skip sentence pairs and can compute M2 efficiently. We noticed this may slightly improve results also.