Problem of evaluating with bleu metric

Hi,

In function EvalStrs(pred_strs, golds) in utils.py, I am sure if the use of bleu_score(candidate, references) is correct. I check torch text document, the inputs for bleu_score should be an iterable of candidate translations and an iterable of iterables of reference translations. But in current code, the inputs are like [['a','b'], ['c', 'd']] and [['e','f'], ['g', 'h']]. Then I test with two same inputs, the bleu score is 0. It seems that there are some problems. Is the inputs for bleu_score is correct when evaluating? Or there are some problems on my understanding?

Thanks.

serenayj / ABCD-ACL2021

Problem of evaluating with bleu metric #5