Inconsistent scores between loop and separate check

Description: I am encountering an issue with SacreBLEU where I am getting inconsistent scores between a loop implementation and a separate check for individual translations. Here are the details of the problem:

sacrebleu.sentence_bleu(sys, [refs])

Scenario: I am calculating BLEU scores for translations using both a loop and individual checks. Expected Behavior: I anticipate consistent scores between the loop and the separate checks for the same translations. Actual Behavior: The scores obtained from the loop implementation differ from the scores obtained from the separate check, even when using the same translation and reference pairs. Example: Here is an example that demonstrates the discrepancy: Translation: sys4 = "..." # Example translation Reference: ref4 = ["..."] # Example reference Expected Score (separate check): 100.0004 Actual Score (loop): 31.94 Steps to Reproduce:

Load the necessary data and libraries. Implement the loop calculation using SacreBLEU, storing scores for each translation. Perform a separate check for a specific translation and reference pair, using the same SacreBLEU calculation. Compare the scores obtained from the loop and separate check. Additional Information:

I have tried modifying the code, removing any potential sources of error, but the discrepancy persists. I have verified that the data inputs are aligned correctly, and the sentence preprocessing is consistent. I suspect there might be an issue related to how SacreBLEU is utilized in the loop implementation. Any guidance or insight into this issue would be greatly appreciated. Thank you!

mjpost / sacrebleu

Inconsistent scores between loop and separate check #237