Errors in computing Code Bleu with some small code instances

microsoft / CodeXGLUE

CodeXGLUE

MIT License

1.56k stars 366 forks source link

Errors in computing Code Bleu with some small code instances #46

Closed masies closed 3 years ago

masies commented 3 years ago

When I try to run the script for computing code bleu with one instance of code only, I sometimes get stuck in a ZeroDivisionError; for example the following one:

target code: private Map<String, ArrayList<Order>> getBuyOrders() { return buyOrders; } predicted code: private HashMap<String, ArrayList<Order>> getBuyOrders() { return buyOrders; }

trying to run : python calc_code_bleu.py --refs target.txt --hyp prediction.txt --lang java --params 0.25,0.25,0.25,0.25

I get this :

Traceback (most recent call last):
  File "calc_code_bleu.py", line 64, in <module>
    dataflow_match_score = dataflow_match.corpus_dataflow_match(references, hypothesis, args.lang)
  File "/content/CodeXGLUE/Code-Code/code-to-code-trans/evaluator/CodeBLEU/dataflow_match.py", line 58, in corpus_dataflow_match
    score = match_count / total_count
ZeroDivisionError: division by zero

Evironment :

sentencepiece==0.1.94
torch==1.4.0 
transformers==3.5.0

Imagist-Shuo commented 3 years ago

Hi, CodeBLEU is calculated on the corpus level. It will calculate the total number of reference data-flows of the whole corpus when calculating data-flow match score. In your case, there is only one instance in the corpus, and if there is no data-flow extracted from this sample, the total count will be 0 and the ZeroDivisionError will occur. We have modified the script to handle this problem, and return the data-flow match score to be 0. However, considering there is no data-flow extracted from it, this score could be ignored as far as I'm concerned, and you can calculate the CodeBLEU score based on the first three features by setting the hyper-parameters to be 1/3, 1/3, 1/3, 0.

masies commented 3 years ago

Thanks a lot for that. To be a little more precise about my use case, what I need is to compute a metric to understand how much a code prediction (generated by NN) differs from a given target prediction. Do you think is it appropriate to use codeBLEU for each pair given that it may not extract the data-flow? is it still reliable enough when it will extract it from a single snippet of code?

Imagist-Shuo commented 3 years ago

Hi, I think CodeBLEU still has a reference value when it may not extract the data-flow, but at this time it degenerates into a simple token-level matching. And when it extracts it from a single snippet of code, I don't think it is still reliable enough. Just as BLEU is more meaningful at the corpus level, CodeBLEU is designed at the corpus level too.