[BUG] bleu.py appears to calculate the bleu score based on individual characters and not tokens.

SThomasen7 commented 1 year ago

Is this the intended behavior? I'm using CodeBert code2nl, for a summarization task, but I have found that the bleu-4 scores are substantially higher than what they should be. Comparing the same file with CodeBert's bleu.py with NLTK sentence bleu. CodeBert reports BLEU-4 as 15.93, and NLTK reports BLEU-4 as 2.34.

I've included the sample files example.gold and example.output which were are dev.x files that I've copied from a validation epoch. I've made a dummy bleu2.py that matches the inputs and outputs of bleu.py from CodeBert.

I get the following results: # CodeBert implementation python3 bleu.py example.gold < example.output Total: 1000 15.937455620314582

# NLTK implementation python3 bleu.py example.gold < example.output Total: 1000 B1 6.009075282124603 B2 4.117860906864496 B3 3.0636532772757037 B4 2.3371375326298556 B5 1.8423517949171309

The bleu score I chose to use is based on the findings of On the Evaluation of Neural Code Summarization.

Is this the intended behavior of CodeBert's evaluation metric? example.gold.txt example.output.txt

Modified code2nl/bleu.py: bleu2.py.txt Please use NLTK 3.6.x or higher, as there is a bug in earlier versions

update I just used bleu.py from GraphCodeBert and it reported 4.11, which seems to correspond with NLTK's sentence bleu-2.

guoday commented 1 year ago

Please refer to the CodeBERT paper. The BLEU-4 script we use is the smoothed BLEU-4 score, which is different from NLTK.

guoday commented 1 year ago

For the task of code summarization: We use smoothed BLEU-4 and also give the reason why using smoothed BLEU-4 in the CodeBERT paper.
For the other tasks of generation like code translation or refinement in GraphCodeBERT: We use corpus-based BLEU-4, which is usually used in the translation tasks.
For NLTK, I guess you use sentence-based BLEU-4, which is different from ours.
The first author of On the Evaluation of Neural Code Summarization has discussed with me before he write the paper. I have explained the calculation and usage scenarios of different BLEU-4 scores.

SThomasen7 commented 1 year ago

I see.

I was having an issue where 30 some predictions that shared no tokens with the hypothesis gave me a BLEU-4 of 33, however, it turns out because the docstrings were parsed using sentence piece, bleu.py was treating a character sentence piece adds to the start of a word as a separate tokens. inflating a score from nearly 0 to 33. My apologies this is an issue with how sentence piece interacts with CodeBERT and not CodeBERT itself.

microsoft / CodeBERT

[BUG] bleu.py appears to calculate the bleu score based on individual characters and not tokens. #180