voidful / BDG

Code for "A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies."
https://voidful.github.io/DG-Showcase/
28 stars 4 forks source link

How did you preprocess the data for BART? #9

Closed jordane95 closed 2 years ago

jordane95 commented 2 years ago

Hi, I'm following your wonderful work on distractor generation. May I know how did you preprocess the RACE dataset for the BART model ? In the instruction from README file, you mentioned the race_train_updated_cqa_dsep_a_bart.csv file. But I can't find the corresponding preprocess code in your convert_data.py script. Is it the same as race_train_updated_cqa_dsep_a.csv ? Thanks.

voidful commented 2 years ago

It basically the same, for bart model, I changed the max length and sep token in convert_data.py script.

jordane95 commented 2 years ago

I follow your strategy to preprocess the data (except that the max len is still 512 since 1024 exceeds my GPU memory) and to train the BART model, but can't get the same BLEU score as repored in your README file. Did I miss some details regarding training/evaluation?

voidful commented 2 years ago

I follow your strategy to preprocess the data (except that the max len is still 512 since 1024 exceeds my GPU memory) and to train the BART model, but can't get the same BLEU score as repored in your README file. Did I miss some details regarding training/evaluation?

  • training
tfkit-train --savedir ./race_cqa_gen_d_bart/ --train ./processed_data/race_train_updated_cqa_dsep_a_bart.csv --test ./processed_data/race_test_updated_cqa_dsep_a_bart.csv --model seq2seq  --config facebook/bart-base --batch 9 --epoch 10 --grad_accum 2 --no_eval
  • evaluation
tfkit-eval --model race_cqa_gen_d_bart/10.pt --valid ./processed_data/race_test_updated_cqa_dall_bart.csv --metric nlg

In the generated 10.pt_dataset_processed_data_race_test_updated_cqa_dall_bartcsv_mode_greedy_filtersim_False_score.csv, the scores are

TASK: nlg , 0
{'Bleu_1': 0.12619165240639885, 'Bleu_2': 0.08175190811168323, 'Bleu_3': 0.05961654844473563, 'Bleu_4': 0.04657402612570478, 'ROUGE_L': 0.24760089629390386, 'CIDEr': 0.5952911696123818}

Why is there such a big degradation in performance?

After checking, it is because of the evaluation preprocessing, in distractor generation case, it needs to compare with multiple target, the previous code use SEP token to separate each target, but not all the model using SEP as separate token, so I use a new token in tfkit, using BART sep token will mistakenly merging all the target result, which will get a much lower token score.

You can refer to here fixing this issue: https://github.com/voidful/BDG/blob/5036753c5fdc459be95f2971af0011483c90259a/data_preprocessing/convert_data_bart.py#L63

Using the same code, I get this score in epoch 5: {'Bleu_1': 0.41048632218844094, 'Bleu_2': 0.2612788659326824, 'Bleu_3': 0.18364643470982944, 'Bleu_4': 0.13581344598264192, 'ROUGE_L': 0.3576483681643398, 'CIDEr': 0.6429426557255493}

Also, you can add --likelihood pos or --likelihood both like, to use Multi-tasking and Negative Answer Training Strategies which purposed in our paper.

tfkit-train --savedir ./race_cqa_gen_d_bart/ --train ./processed_data/race_train_updated_cqa_dsep_a_bart.csv --test ./processed_data/race_test_updated_cqa_dsep_a_bart.csv --model seq2seq  --config facebook/bart-base --batch 9 --epoch 10 --grad_accum 2 --no_eval --likelihood pos 
jordane95 commented 2 years ago

Thanks. After adapting the SEP token, I get nearly the same score as yours!