voidful / BDG

Code for "A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies."
https://voidful.github.io/DG-Showcase/
28 stars 4 forks source link

Different results with different tfkit version #10

Open jyw777 opened 2 years ago

jyw777 commented 2 years ago

Hi, I'm trying to reproduce your fantastic results based on BART model. I use the trained model you provided: https://github.com/voidful/BDG/releases/download/v2.0/BDG_ANPM.pt

When I use tfkit==0.7.0(suggested by readme), I get the result like this: {'Bleu_1': 0.4116063603355367, 'Bleu_2': 0.2629480211200134, 'Bleu_3': 0.19128546675900487, 'Bleu_4': 0.1484759134861437, 'ROUGE_L': 0.2184638476496905, 'CIDEr': 0.07954905358236805} The value of ROUGE_L is much lower than the reported value, while the BLEU value is similar to the reported value. It takes me about half an hour for evaluation.

However, when I use tfkit==0.8.1(latest), I get the result like this: 'Bleu_1': 0.40226892712763984, 'Bleu_2': 0.2566475644205321, 'Bleu_3': 0.18535836171285228, 'Bleu_4': 0.14348238003117275, 'ROUGE_L': 0.3556143135035776, 'CIDEr': 0.6532226297900213 The value is similar to the reported one, but it takes much more time (about 2.5 hours) for evaluation on the same GPU, and the tqdm doesn't show the progress bar.

I was wondering why different tfkit versions would cause different results and different evaluation time. Which version should I use? Thank you very much!

voidful commented 2 years ago

In newer version, I fixed some issue regarding to the prediction evaluation(eval score not always calculate with all target) and efficiency improvement by taking past key value into account, it seems to have a reverse result, I will look into that.