gongliym / data2text-transformer

Enhanced Transformer Model for Data-to-Text Generation
29 stars 11 forks source link

AssertionError in data loading #3

Open wingedRuslan opened 3 years ago

wingedRuslan commented 3 years ago

Hi,

I was following the steps in the readme to train the model on the ROTOWIRE dataset.

Unfortunately, in the Model Training step, I got the AssertionError in load_para_data() function (see below for full log).

Could you please point on how to fix the error?

Many thanks, Ruslan

INFO - 01/06/21 12:21:36 - 0:00:00 - ============ Data summary ============
INFO - 01/06/21 12:21:36 - 0:00:00 - Loading data from rotowire/train.gtable.pth ...
INFO - 01/06/21 12:21:36 - 0:00:00 - Removed 0 empty sentences.

INFO - 01/06/21 12:21:36 - 0:00:00 - Content-Selection Data -       3398
INFO - 01/06/21 12:21:36 - 0:00:00 - Loading data from rotowire/train.gtable.pth ...
INFO - 01/06/21 12:21:36 - 0:00:00 - Loading data from rotowire/train.summary.pth ...
INFO - 01/06/21 12:21:36 - 0:00:00 - Removed 0 empty sentences.
INFO - 01/06/21 12:21:36 - 0:00:00 - Removed 0 empty sentences.
INFO - 01/06/21 12:21:36 - 0:00:00 - Para Data          -       3398
INFO - 01/06/21 12:21:36 - 0:00:00 - Loading data from rotowire/valid.gtable.pth ...
INFO - 01/06/21 12:21:36 - 0:00:00 - Loading data from rotowire/valid.summary.pth ...
Traceback (most recent call last):
  File "/home/ruslan_yermakov/nlg-ra/reproducibility/data2text/model/train.py", line 229, in <module>
    main(params)
  File "/home/ruslan_yermakov/nlg-ra/reproducibility/data2text/model/train.py", line 162, in main
    data = load_data(params)
  File "/home/ruslan_yermakov/nlg-ra/reproducibility/data2text/model/src/data/loader.py", line 187, in load_data
    dataset = load_para_data(data, params, params.valid_table_path, params.valid_summary_path, 'valid')
  File "/home/ruslan_yermakov/nlg-ra/reproducibility/data2text/model/src/data/loader.py", line 129, in load_para_data
    assert data['source_dico'] == table_data['dico']
AssertionError
KonstantinRothe commented 3 years ago

I was able to fix this for my own dataset. I had to create a file that contain every token of both the training and validation set (basically merge train & valid set for the creation of the Dictionary). I kept the structure of the original files for this new file. I followed the steps to preprocess my new file. You should get a 'newfile_vocab' or something alike during the preprocessing. I used this newfile_vocab as the input vocab for both train and valid dataset. After this fix it worked. I have no idea how the author got this to work with the instructions as given.