Closed SThomasen7 closed 1 year ago
Please refer to the CodeBERT paper. The BLEU-4 script we use is the smoothed BLEU-4 score, which is different from NLTK.
For the task of code summarization: We use smoothed BLEU-4 and also give the reason why using smoothed BLEU-4 in the CodeBERT paper.
For the other tasks of generation like code translation or refinement in GraphCodeBERT: We use corpus-based BLEU-4, which is usually used in the translation tasks.
For NLTK, I guess you use sentence-based BLEU-4, which is different from ours.
The first author of On the Evaluation of Neural Code Summarization has discussed with me before he write the paper. I have explained the calculation and usage scenarios of different BLEU-4 scores.
I see.
I was having an issue where 30 some predictions that shared no tokens with the hypothesis gave me a BLEU-4 of 33, however, it turns out because the docstrings were parsed using sentence piece, bleu.py was treating a character sentence piece adds to the start of a word as a separate tokens. inflating a score from nearly 0 to 33. My apologies this is an issue with how sentence piece interacts with CodeBERT and not CodeBERT itself.
Is this the intended behavior? I'm using CodeBert code2nl, for a summarization task, but I have found that the bleu-4 scores are substantially higher than what they should be. Comparing the same file with CodeBert's bleu.py with NLTK sentence bleu. CodeBert reports BLEU-4 as 15.93, and NLTK reports BLEU-4 as 2.34.
I've included the sample files
example.gold
andexample.output
which were are dev.x files that I've copied from a validation epoch. I've made a dummybleu2.py
that matches the inputs and outputs ofbleu.py
from CodeBert.I get the following results:
# CodeBert implementation
python3 bleu.py example.gold < example.output
Total: 1000
15.937455620314582
# NLTK implementation
python3 bleu.py example.gold < example.output
Total: 1000
B1 6.009075282124603
B2 4.117860906864496
B3 3.0636532772757037
B4 2.3371375326298556
B5 1.8423517949171309
The bleu score I chose to use is based on the findings of On the Evaluation of Neural Code Summarization.
Is this the intended behavior of CodeBert's evaluation metric? example.gold.txt example.output.txt
Modified code2nl/bleu.py: bleu2.py.txt Please use NLTK 3.6.x or higher, as there is a bug in earlier versions
update I just used bleu.py from GraphCodeBert and it reported 4.11, which seems to correspond with NLTK's sentence bleu-2.