evaluating TransCoder on AVATAR test set and bug in compile.py?

wasiahmad / AVATAR

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

https://arxiv.org/abs/2108.11590

Creative Commons Attribution Share Alike 4.0 International

53 stars 10 forks source link

evaluating TransCoder on AVATAR test set and bug in compile.py? #1

Closed kunalpagarey closed 3 years ago

kunalpagarey commented 3 years ago

Hi @wasiahmad, I am using https://github.com/wasiahmad/AVATAR/blob/main/transcoder/run.sh to evaluate transcoder on AVATAR test data (test.java-python.java, test.java-python.java and test.jsonl ), 1699 samples. The scores do not match exactly with the paper. Could you please tell me if I am missing anything.

Also I think the success and error variables are swapped in the below line in format function due to which the current output for Python to Java is Success - 1699, Errors - 0 instead Success - 0, Errors - 1699. In that case the CA is 0 for Python to Java. Please correct me if I am wrong. https://github.com/wasiahmad/AVATAR/blob/7bcabc9c7ff82190e878ed0fd3d93e9eb86fea2e/evaluation/compile.py#L90

Thanks and Regards, Kunal Pagarey

wasiahmad commented 3 years ago

Hi, we have recently updated the codebase and after that, we didn't re-run the experiments. This could be a reason for the performance difference. Also, the results may slightly vary if experiments run in a different machine. By the way, how much performance difference you are observing?

Yes, you are absolutely right, and thank you for pointing that out. It is a silly mistake, it means the CA scores we reported in our paper for Python to Java translation are not accurate, they are indicating the percentage of errors.

kunalpagarey commented 3 years ago

Thanks for quick reply. These are the scores that I got: Java to Python: BLEU: 37.31 Exact Match: 0.00 Ngram match: 38.32 Weighted ngram: 30.40 Syntax match: 28.58 Dataflow match: 30.51 CodeBLEU score: 31.95

Success - 1296, Errors - 403 [Syntax - 400, Indent - 3]

Python to Java: BLEU: 50.55 Exact Match: 0.00 Ngram match: 44.80 Weighted ngram: 26.74 Syntax match: 36.56 Dataflow match: 13.60 CodeBLEU score: 30.42

Success - 1699, Errors - 0 [Total - 22676]

wasiahmad commented 3 years ago

I think the numbers you got are reasonable.