Potential error in BLEU score computation

Code in question is here: https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py#L70

When computing the reference length, the current version combines the minimum length of the references. I'm not sure if this was done on purpose, but the original paper (section 2.2.2) suggests this should be the length of the reference translation that is closest to the length of the candidate translation.

An example:

candidate_corpus = [['I', 'ate', 'an', 'apple']]
references_corpus = [[['Bye'], ['I', 'ate', 'an', 'apple', 'here']]]
compute_bleu(candidate_corpus, references_corpus, max_order=4, smooth=False)

The ouput of this is (1.0, [1.0, 1.0, 1.0, 1.0], 1.0, 4.0, 4, 1).

Specifically, the reference length (last number of the output) is 1, because ['Bye'] is one of the references. This should be 5 since the other reference is closest in length to the candidate.

With the current reference length of 1, the ratio is set to 4/1=4, the brevity penalty is 1 and the BLEU score is 1
With the corrected reference length of 5, the ratio is set to 4/5=0.8, the brevity penalty is ~0.7788 and the BLEU score is ~0.7788

tensorflow / nmt

Potential error in BLEU score computation #450