'Levenshtein greater than source size' when training re-ranker

Hi!

I am trying to train your model but I keep running into different errors when trying to train the re-ranker. If I use the dev.m2 file (which contains all the original annotations, without applying BPE), I can finish training but I get a lot of "Levenshtein distance is greater than source size" warnings. I did the initial preprocessing of the dataset I'm using with my own scripts (mostly for correcting the sentences), so I'm not sure whether the dev.m2 file used is the same as mine. I have checked the prepare_data.sh script and it seems it might just be the original annotation file? I have also tried using the .src and .trg dev sets, which are split with BPE unlike the .m2 file, and the error changes to 'segmentation fault' (which makes sense as the nbest outputs have been 'de-bpe-ized'). This is what I get when I try using the src file just in case:

+ python2.7 /home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py -i /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt -r processed/dev.src -c /DATA/slima_data/models/training//rerank_config.ini --threads 12 --tuning-metric m2 --predictable-seed -o /DATA/slima_data/models/training/ --moses-dir ../../mosesdecoder --no-add-weight [INFO] [01-10-2019 10:45:29] Arguments: [INFO] [01-10-2019 10:45:29] alg: mert [INFO] [01-10-2019 10:45:29] command: /home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py -i /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt -r processed/dev.src -c /DATA/slima_data/models/training//rerank_config.ini --threads 12 --tuning-metric m2 --predictable-seed -o /DATA/slima_data/models/training/ --moses-dir ../../mosesdecoder --no-add-weight [INFO] [01-10-2019 10:45:29] init_value: 0.05 [INFO] [01-10-2019 10:45:29] input_config: /DATA/slima_data/models/training//rerank_config.ini [INFO] [01-10-2019 10:45:29] input_nbest: /DATA/slima_data/models/dev.output.tok.nbest.reformat.augmented.txt [INFO] [01-10-2019 10:45:29] metric: m2 [INFO] [01-10-2019 10:45:29] moses_dir: ../../mosesdecoder [INFO] [01-10-2019 10:45:29] no_add_weight: True [INFO] [01-10-2019 10:45:29] out_dir: /DATA/slima_data/models/training/ [INFO] [01-10-2019 10:45:29] pred_seed: True [INFO] [01-10-2019 10:45:29] ref_paths: processed/dev.src [INFO] [01-10-2019 10:45:29] threads: 12 [INFO] [01-10-2019 10:45:29] Reading weights from config file [INFO] [01-10-2019 10:45:29] Feature weights: ['F0= 0.5', 'EditOps0= 0.2 0.2 0.2'] [INFO] [01-10-2019 10:45:29] Extracting stats and features [WARNING] The optional arguments of extractor are not used yet [INFO] [01-10-2019 10:45:29] Executing command: ../../mosesdecoder/bin/extractor --sctype M2SCORER --scconfig ignore_whitespace_casing:true -r processed/dev.src -n /DATA/slima_data/models/training//augmented.nbest --scfile /DATA/slima_data/models/training//statscore.data --ffile /DATA/slima_data/models/training//features.data Binary write mode is NOT selected Scorer type: M2SCORER name: ignore_whitespace_casing value: true Segmentation fault (core dumped) [INFO] [01-10-2019 10:45:29] Running MERT [INFO] [01-10-2019 10:45:29] Command: ../../mosesdecoder/bin/mert -d 4 -S /DATA/slima_data/models/training//statscore.data -F /DATA/slima_data/models/training//features.data --ifile /DATA/slima_data/models/training//init.opt --threads 12 -r 1 --sctype M2SCORER --scconfig ignore_whitespace_casing:true shard_size = 0 shard_count = 0 Seeding random numbers with 1 name: ignore_whitespace_casing value: true Data::m_score_type M2Scorer Data::Scorer type from Scorer: M2Scorer Loading Data from: /DATA/slima_data/models/training//statscore.data and /DATA/slima_data/models/training//features.data loading feature data from /DATA/slima_data/models/training//features.data loading score data from /DATA/slima_data/models/training//statscore.data Data loaded : [Wall 0.000461 CPU 0.000457] seconds. Creating a pool of 12 threads terminate called recursively terminate called after throwing an instance of 'std::runtime_error' Aborted (core dumped) [INFO] [01-10-2019 10:45:29] Optimization complete. Traceback (most recent call last): File "/home/slima/projects/mlconvgec2018/software/nbest-reranker/train.py", line 93, in <module> assert os.path.isfile('weights.txt') AssertionError

Any idea of what the problem might be? I think it's probably that the file I used for the -r argument of the re-ranker train.py script might be different to the one you used because of the preprocessing. I also trained my own BPE model and my own embeddings. As you can see in the error excerpt, right now I'm trying to train using only eo features, if that matters.

Thank you in advance, and thank you for making your code open too. It's very helpful.

Edit: nevermind, it was my fault because I forgot I shuffled the dev sentences when preprocessing them. Thank you anyway and sorry!

Edit 2: closed too early because I thought it was just a stupid mistake but the problem persists... should the m2 file not include the annotations?

nusnlp / mlconvgec2018

'Levenshtein greater than source size' when training re-ranker #29