harvardnlp / seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
http://nlp.seas.harvard.edu/code
MIT License
1.26k stars 278 forks source link

Preprocess failed for the WMT'14 #92

Closed IdiosyncraticDragon closed 7 years ago

IdiosyncraticDragon commented 7 years ago

This problem puzzled me a day. I wanted to retrain the pre-trained model after pruning the model by prune.py. I downloaded the parallel data set of English<->German from http://www.statmt.org/wmt14/translation-task.html, which are commoncrawl.de-en.en europarl-v7.de-en.en news-commentary-v10.de-en.en commoncrawl.de-en.de europarl-v7.de-en.de news-commentary-v10.de-en.de

Then I concatenated commoncrawl.de-en.en, europarl-v7.de-en.en, news-commentary-v10.de-en.en together as train.en, and commoncrawl.de-en.de, europarl-v7.de-en.de, news-commentary-v10.de-en.de together as train.de. So that I can prepare the *.hdf5 files for training. The train.en contains 4535522 sentences, so as train.de.

But than I met the error: ` $> python preprocess.py --srcfile /home/data/wmt/wmt15-de-en/train.en --targetfile /home/data/wmt/wmt15-de-en/train.de --srcvalfile data/src-val.txt --targetvalfile data/targ-val.txt --outputfile data/wmt-deen

First pass through data to get vocab... Number of sentences in training: 4294257 Number of sentences in valid: 2819 Number of additional features on source side: 1 * source feature 1 of size: 5 Source vocab size: Original = 1718846, Pruned = 50004 Target vocab size: Original = 2913125, Pruned = 50004 Traceback (most recent call last): File "preprocess.py", line 524, in sys.exit(main(sys.argv[1:])) File "preprocess.py", line 521, in main get_data(args) File "preprocess.py", line 456, in get_data max_word_l, max_sent_l, args.chars, args.unkfilter, args.shuffle) File "preprocess.py", line 282, in convert sources_features[i][sent_id] = np.array(sourcefeatures[i], dtype=int) TypeError: 'NoneType' object has no attribute '__getitem_\' `

Did I do the right process to generate the training set of WMT' 14? Or is there any uploaded hdf5 files for the pre-trained model? Further more, if I generate the hdf5 files successfully, can I use them to retrain the pre-trained model directly, since the contents of .dict for the pre-trained model may be different from those in the new generated *.dict by preprocess.py?

guillaumekln commented 7 years ago

Could you check that your data do not contain the word features separator?

https://github.com/harvardnlp/seq2seq-attn#using-additional-input-features

For the second question, you can't retrain a model with a different vocabulary.

IdiosyncraticDragon commented 7 years ago

@guillaumekln Yes, There are some feature separaor "|" in the data. What should I do to them? Should I directly delete them?

guillaumekln commented 7 years ago

The separator is the sequence -|-. You should check those instead and remove or replace them.

IdiosyncraticDragon commented 7 years ago

@guillaumekln It works! I am now training the model by the processed data. Thank you very much.