Closed jordane95 closed 2 years ago
It basically the same, for bart model, I changed the max length and sep token in convert_data.py script.
I follow your strategy to preprocess the data (except that the max len is still 512 since 1024 exceeds my GPU memory) and to train the BART model, but can't get the same BLEU score as repored in your README file. Did I miss some details regarding training/evaluation?
tfkit-train --savedir ./race_cqa_gen_d_bart/ --train ./processed_data/race_train_updated_cqa_dsep_a_bart.csv --test ./processed_data/race_test_updated_cqa_dsep_a_bart.csv --model seq2seq --config facebook/bart-base --batch 9 --epoch 10 --grad_accum 2 --no_eval
tfkit-eval --model race_cqa_gen_d_bart/10.pt --valid ./processed_data/race_test_updated_cqa_dall_bart.csv --metric nlg
In the generated 10.pt_dataset_processed_data_race_test_updated_cqa_dall_bartcsv_mode_greedy_filtersim_False_score.csv
, the scores are
TASK: nlg , 0
{'Bleu_1': 0.12619165240639885, 'Bleu_2': 0.08175190811168323, 'Bleu_3': 0.05961654844473563, 'Bleu_4': 0.04657402612570478, 'ROUGE_L': 0.24760089629390386, 'CIDEr': 0.5952911696123818}
Why is there such a big degradation in performance?
I follow your strategy to preprocess the data (except that the max len is still 512 since 1024 exceeds my GPU memory) and to train the BART model, but can't get the same BLEU score as repored in your README file. Did I miss some details regarding training/evaluation?
- training
tfkit-train --savedir ./race_cqa_gen_d_bart/ --train ./processed_data/race_train_updated_cqa_dsep_a_bart.csv --test ./processed_data/race_test_updated_cqa_dsep_a_bart.csv --model seq2seq --config facebook/bart-base --batch 9 --epoch 10 --grad_accum 2 --no_eval
- evaluation
tfkit-eval --model race_cqa_gen_d_bart/10.pt --valid ./processed_data/race_test_updated_cqa_dall_bart.csv --metric nlg
In the generated
10.pt_dataset_processed_data_race_test_updated_cqa_dall_bartcsv_mode_greedy_filtersim_False_score.csv
, the scores areTASK: nlg , 0 {'Bleu_1': 0.12619165240639885, 'Bleu_2': 0.08175190811168323, 'Bleu_3': 0.05961654844473563, 'Bleu_4': 0.04657402612570478, 'ROUGE_L': 0.24760089629390386, 'CIDEr': 0.5952911696123818}
Why is there such a big degradation in performance?
After checking, it is because of the evaluation preprocessing, in distractor generation case, it needs to compare with multiple target, the previous code use SEP token to separate each target, but not all the model using SEP as separate token, so I use a new token in tfkit, using BART sep token will mistakenly merging all the target result, which will get a much lower token score.
You can refer to here fixing this issue: https://github.com/voidful/BDG/blob/5036753c5fdc459be95f2971af0011483c90259a/data_preprocessing/convert_data_bart.py#L63
Using the same code, I get this score in epoch 5:
{'Bleu_1': 0.41048632218844094, 'Bleu_2': 0.2612788659326824, 'Bleu_3': 0.18364643470982944, 'Bleu_4': 0.13581344598264192, 'ROUGE_L': 0.3576483681643398, 'CIDEr': 0.6429426557255493}
Also, you can add --likelihood pos
or --likelihood both
like, to use Multi-tasking and Negative Answer Training Strategies which purposed in our paper.
tfkit-train --savedir ./race_cqa_gen_d_bart/ --train ./processed_data/race_train_updated_cqa_dsep_a_bart.csv --test ./processed_data/race_test_updated_cqa_dsep_a_bart.csv --model seq2seq --config facebook/bart-base --batch 9 --epoch 10 --grad_accum 2 --no_eval --likelihood pos
Thanks. After adapting the SEP token, I get nearly the same score as yours!
Hi, I'm following your wonderful work on distractor generation. May I know how did you preprocess the RACE dataset for the BART model ? In the instruction from README file, you mentioned the
race_train_updated_cqa_dsep_a_bart.csv
file. But I can't find the corresponding preprocess code in yourconvert_data.py
script. Is it the same asrace_train_updated_cqa_dsep_a.csv
? Thanks.