Closed IdiosyncraticDragon closed 7 years ago
Could you check that your data do not contain the word features separator?
https://github.com/harvardnlp/seq2seq-attn#using-additional-input-features
For the second question, you can't retrain a model with a different vocabulary.
@guillaumekln Yes, There are some feature separaor "|" in the data. What should I do to them? Should I directly delete them?
The separator is the sequence -|-
. You should check those instead and remove or replace them.
@guillaumekln It works! I am now training the model by the processed data. Thank you very much.
This problem puzzled me a day. I wanted to retrain the pre-trained model after pruning the model by prune.py. I downloaded the parallel data set of English<->German from http://www.statmt.org/wmt14/translation-task.html, which are
commoncrawl.de-en.en europarl-v7.de-en.en news-commentary-v10.de-en.en commoncrawl.de-en.de europarl-v7.de-en.de news-commentary-v10.de-en.de
Then I concatenated commoncrawl.de-en.en, europarl-v7.de-en.en, news-commentary-v10.de-en.en together as train.en, and commoncrawl.de-en.de, europarl-v7.de-en.de, news-commentary-v10.de-en.de together as train.de. So that I can prepare the *.hdf5 files for training. The train.en contains 4535522 sentences, so as train.de.
But than I met the error: ` $> python preprocess.py --srcfile /home/data/wmt/wmt15-de-en/train.en --targetfile /home/data/wmt/wmt15-de-en/train.de --srcvalfile data/src-val.txt --targetvalfile data/targ-val.txt --outputfile data/wmt-deen
First pass through data to get vocab... Number of sentences in training: 4294257 Number of sentences in valid: 2819 Number of additional features on source side: 1 * source feature 1 of size: 5 Source vocab size: Original = 1718846, Pruned = 50004 Target vocab size: Original = 2913125, Pruned = 50004 Traceback (most recent call last): File "preprocess.py", line 524, in
sys.exit(main(sys.argv[1:]))
File "preprocess.py", line 521, in main
get_data(args)
File "preprocess.py", line 456, in get_data
max_word_l, max_sent_l, args.chars, args.unkfilter, args.shuffle)
File "preprocess.py", line 282, in convert
sources_features[i][sent_id] = np.array(sourcefeatures[i], dtype=int)
TypeError: 'NoneType' object has no attribute '__getitem_\'
`
Did I do the right process to generate the training set of WMT' 14? Or is there any uploaded hdf5 files for the pre-trained model? Further more, if I generate the hdf5 files successfully, can I use them to retrain the pre-trained model directly, since the contents of .dict for the pre-trained model may be different from those in the new generated *.dict by preprocess.py?