Open jyw777 opened 2 years ago
In newer version, I fixed some issue regarding to the prediction evaluation(eval score not always calculate with all target) and efficiency improvement by taking past key value into account, it seems to have a reverse result, I will look into that.
Hi, I'm trying to reproduce your fantastic results based on BART model. I use the trained model you provided: https://github.com/voidful/BDG/releases/download/v2.0/BDG_ANPM.pt
When I use tfkit==0.7.0(suggested by readme), I get the result like this:
{'Bleu_1': 0.4116063603355367, 'Bleu_2': 0.2629480211200134, 'Bleu_3': 0.19128546675900487, 'Bleu_4': 0.1484759134861437, 'ROUGE_L': 0.2184638476496905, 'CIDEr': 0.07954905358236805}
The value of ROUGE_L is much lower than the reported value, while the BLEU value is similar to the reported value. It takes me about half an hour for evaluation.However, when I use tfkit==0.8.1(latest), I get the result like this:
'Bleu_1': 0.40226892712763984, 'Bleu_2': 0.2566475644205321, 'Bleu_3': 0.18535836171285228, 'Bleu_4': 0.14348238003117275, 'ROUGE_L': 0.3556143135035776, 'CIDEr': 0.6532226297900213
The value is similar to the reported one, but it takes much more time (about 2.5 hours) for evaluation on the same GPU, and the tqdm doesn't show the progress bar.I was wondering why different tfkit versions would cause different results and different evaluation time. Which version should I use? Thank you very much!