Error occurs when run preprocess.py

BiggyBing commented 3 years ago

Hi, I encounter the following error when I ran:

python preprocess.py -config config/preprocess/config-preprocess-keyphrase-kp20k.yml

wandb: WARNING W&B installed but not logged in. Run wandb login or set the WANDB_API_KEY env variable. [2021-04-19 04:33:33,536 INFO] Extracting features... [2021-04-19 04:33:33,537 INFO] number of source features: 0. [2021-04-19 04:33:33,537 INFO] number of target features: 0. [2021-04-19 04:33:33,537 INFO] Building Fields object... [2021-04-19 04:33:33,537 INFO] Building & saving training data... [2021-04-19 04:33:33,537 INFO] Using existing vocabulary... [2021-04-19 04:33:35,631 INFO] Building shard 0. multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/home/bingyang/keyphrase/transformer_gan/OpenNMT-kpg/onmt/bin/preprocess.py", line 70, in process_one_shard filter_pred=filter_pred File "/home/bingyang/keyphrase/transformer_gan/OpenNMT-kpg/onmt/inputters/keyphrase_dataset.py", line 164, in init self.dataset_type = infer_dataset_type(dirs[0]) File "/home/bingyang/keyphrase/transformer_gan/OpenNMT-kpg/onmt/inputters/keyphrase_dataset.py", line 60, in infer_dataset_type 'Accecpted values:' + KP_DATASET_FIELDS.keys() TypeError: must be str, not dict_keys """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "preprocess.py", line 6, in main() File "/home/bingyang/keyphrase/transformer_gan/OpenNMT-kpg/onmt/bin/preprocess.py", line 310, in main preprocess(opt) File "/home/bingyang/keyphrase/transformer_gan/OpenNMT-kpg/onmt/bin/preprocess.py", line 290, in preprocess 'train', fields, src_reader, tgt_reader, align_reader, opt) File "/home/bingyang/keyphrase/transformer_gan/OpenNMT-kpg/onmt/bin/preprocess.py", line 217, in build_save_dataset for sub_counter in p.imap(func, shard_iter): File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next raise value TypeError: must be str, not dict_keys

I wonder if the processed data (*.pt) for magkp can be provided like kp20k.

memray commented 3 years ago

Sorry, I wanted to, but the size is too large. Please check out the latest code and use the vocab file here. It should be working now.

memray commented 3 years ago

Also, please use those commands to preprocess magkp :D python -m kp_data_converter -src_file data/magkp/magkp_training.json -output_path data/magkp/magkp_train -lower -filter -max_src_seq_length 1000 -min_src_seq_length 10 -max_tgt_seq_length 8 -min_src_seq_length 1 -shuffle

python preprocess.py -config config/preprocess/config-preprocess-keyphrase-magkp.yml

BiggyBing commented 3 years ago

@memray Thanks for your reply and I think my problem has been addressed in your latest code.

Niyx52094 commented 2 years ago

this “python preprocess.py -config config/preprocess/config-preprocess-keyphrase-magkp.yml“ seems not right now. Because there is no any preprocess.py in the file is it? The only proprcess.py I can find is in the keyphrase file and it does not need any -config parser argument.

memray commented 2 years ago

@ Niyx52094 No more need for preprocess.py since preprocessing is done on the fly by the transform specified in the config file.

I'll share more config files (just realized that they were not there).

Niyx52094 commented 2 years ago

@ Niyx52094 No more need for preprocess.py since preprocessing is done on the fly by the transform specified in the config file.

I'll share more config files (just realized that they were not there).

Great thank you for your quick responce!!! Actuallly I am trying to use CopyRNN(not CNN, Transformer or Bart right now) in other dataset (Openkp). In this case, I want to build the training.src, training.tgt file as well as vocab.json file in your data format way. So, what you mean is when I use kg_data_converter.py to get .src and .tgt file, and after I get the vocab.json in build_vocab.py( or I can build the vocab by my self ? ).Then I can use the yml in the config.file to get my result is it? Hope to get your answer soon. Thankyou!

memray commented 2 years ago

@Niyx52094 good question. I upgraded the codebase to OpenNMT so that the current pipeline directly loads data from raw json files (so there's no need to use kg_data_converter.py for generating .src/.tgt files). Yes, vocab is generated prior to training, so you need to run build_vocab.py. magkp20k.vocab.json can be downloaded here.

memray / OpenNMT-kpg-release

Error occurs when run preprocess.py #33