lvapeab / nmt-keras

Neural Machine Translation with Keras
http://nmt-keras.readthedocs.io
MIT License
533 stars 130 forks source link

Data Error ? #137

Closed VP007-py closed 2 years ago

VP007-py commented 3 years ago

This is the error log I get while reinstalling and running the model with python3 main.py. Any fixes/suggestions with keras_wrapper ?

Using Theano backend.
[02/08/2020 11:27:52] <<< Cupy not available. Using numpy. >>>
[02/08/2020 11:27:52] Running training.
[02/08/2020 11:27:52] Building mydata5_enhi dataset
[02/08/2020 11:27:53]   Applying tokenization function: "tokenize_none".
[02/08/2020 11:27:53] Creating vocabulary for data with data_id 'target_text'.
[02/08/2020 11:27:53]    Total: 42963 unique words in 80000 sentences with a total of 466800 words.
[02/08/2020 11:27:53] Creating dictionary of all words
[02/08/2020 11:27:53] Loaded "train" set outputs of data_type "text" with data_id "target_text" and length 80000.
[02/08/2020 11:27:53] Loaded "train" set outputs of type "file-name" with id "raw_target_text".
[02/08/2020 11:27:53]   Applying tokenization function: "tokenize_none".
[02/08/2020 11:27:53] Loaded "val" set outputs of data_type "text" with data_id "target_text" and length 1800.
[02/08/2020 11:27:53] Loaded "val" set outputs of type "file-name" with id "raw_target_text".
[02/08/2020 11:27:53]   Applying tokenization function: "tokenize_none".
[02/08/2020 11:27:53] Loaded "test" set outputs of data_type "text" with data_id "target_text" and length 1200.
[02/08/2020 11:27:53] Loaded "test" set outputs of type "file-name" with id "raw_target_text".
[02/08/2020 11:27:53]   Applying tokenization function: "tokenize_none".
[02/08/2020 11:27:53] Creating vocabulary for data with data_id 'source_text'.
[02/08/2020 11:27:54]    Total: 34428 unique words in 80000 sentences with a total of 401199 words.
[02/08/2020 11:27:54] Creating dictionary of all words
Traceback (most recent call last):
  File "main.py", line 49, in <module>
    train_model(parameters, args.dataset)
  File "/home/pandramish.vinay/nmt-keras/nmt_keras/training.py", line 62, in train_model
    dataset = build_dataset(params)
  File "/home/pandramish.vinay/nmt-keras/data_engine/prepare_data.py", line 192, in build_dataset
    bpe_codes=params.get('BPE_CODES_PATH', None))
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras_wrapper/dataset.py", line 1255, in setInput
    add_additional)
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras_wrapper/dataset.py", line 1282, in __setInput
    self.__checkLengthSet(set_name)
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras_wrapper/dataset.py", line 4978, in __checkLengthSet
    lengths.append(len(getattr(self, 'Y_' + set_name)[id_out]))
KeyError: 'raw_target_text'
lvapeab commented 3 years ago

Hi Vinay,

Unfortunately, I'm unable to reproduce the error. Please, attach the config.py file and make sure you are working with the latest version. If you modified data_engine/prepare_data.py, please also share it.

That being said, my guess is that you are calling setInput when you want to call setRawOutput somewhere in data_engine/prepare_data.py. However, note that setRawOutput was removed in https://github.com/lvapeab/nmt-keras/commit/4ba94e2b88fc2c098a9ea9f528e0b8ca75ffc32d as it made not sense to keep this in the dataset for the general use case.

VP007-py commented 3 years ago

Hey, apologies for not updating...the current version works perfectly fine !

VP007-py commented 3 years ago

Once again, I get a similar error for different datasets. I did check the parallel corpora and there are no issues with it


Using TensorFlow backend.
[11/08/2020 19:49:41] Limited tf.compat.v2.summary API due to missing TensorBoard installation.
[11/08/2020 19:49:44] Running training.
[11/08/2020 19:49:44] Building Newdataset_hien dataset
[11/08/2020 19:49:45]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:49:45] Creating vocabulary for data with data_id 'target_text'.
[11/08/2020 19:49:46]    Total: 97033 unique words in 95000 sentences with a total of 1977052 words.
[11/08/2020 19:49:46] Creating dictionary of all words
[11/08/2020 19:49:47] Loaded "train" set outputs of data_type "text-features" with data_id "target_text" and length 95000.
[11/08/2020 19:49:47]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:49:47] Loaded "val" set outputs of data_type "text" with data_id "target_text" and length 5000.
[11/08/2020 19:49:47]   Applying tokenization function: "tokenize_none".
[11/08/2020 19:49:47] Loaded "test" set outputs of data_type "text" with data_id "target_text" and length 2500.
[11/08/2020 19:49:47]   Applying tokenization function: "tokenize_none".
Traceback (most recent call last):
  File "main.py", line 51, in <module>
    train_model(parameters, args.dataset)
  File "/home/pandramish.vinay/nmt-keras/nmt_keras/training.py", line 74, in train_model
    dataset = build_dataset(params)
  File "/home/pandramish.vinay/nmt-keras/data_engine/prepare_data.py", line 185, in build_dataset
    bpe_codes=params.get('BPE_CODES_PATH', None))
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras_wrapper/dataset.py", line 1204, in setInput
    use_unk_class=use_unk_class)
  File "/home/pandramish.vinay/.local/lib/python3.5/site-packages/keras_wrapper/dataset.py", line 2097, in preprocessTextFeatures
    '" in order to process the type "text" data. Set "build_vocabulary" to True if you want to use the current data for building the vocabulary.')
Exception: The dataset must include a vocabulary with data_id "source_text" in order to process the type "text" data. Set "build_vocabulary" to True if you want to use the current data for building the vocabulary.

``
lvapeab commented 3 years ago

Did you set build_vocabulary = True when building the Dataset object?

VP007-py commented 3 years ago

I did enable build_vocabulary = True in ds.setInput here and the same error occurs somtimes

lvapeab commented 3 years ago

Sometimes it fails... but other times it works? Weird

Can you share your config.py file?