Closed kunalpagarey closed 3 years ago
Hi, we have recently updated the codebase and after that, we didn't re-run the experiments. This could be a reason for the performance difference. Also, the results may slightly vary if experiments run in a different machine. By the way, how much performance difference you are observing?
Yes, you are absolutely right, and thank you for pointing that out. It is a silly mistake, it means the CA scores we reported in our paper for Python to Java translation are not accurate, they are indicating the percentage of errors.
Thanks for quick reply. These are the scores that I got: Java to Python: BLEU: 37.31 Exact Match: 0.00 Ngram match: 38.32 Weighted ngram: 30.40 Syntax match: 28.58 Dataflow match: 30.51 CodeBLEU score: 31.95
Success - 1296, Errors - 403 [Syntax - 400, Indent - 3]
Python to Java: BLEU: 50.55 Exact Match: 0.00 Ngram match: 44.80 Weighted ngram: 26.74 Syntax match: 36.56 Dataflow match: 13.60 CodeBLEU score: 30.42
Success - 1699, Errors - 0 [Total - 22676]
I think the numbers you got are reasonable.
Hi @wasiahmad, I am using https://github.com/wasiahmad/AVATAR/blob/main/transcoder/run.sh to evaluate transcoder on AVATAR test data (test.java-python.java, test.java-python.java and test.jsonl ), 1699 samples. The scores do not match exactly with the paper. Could you please tell me if I am missing anything.
Also I think the
success
anderror
variables are swapped in the below line informat
function due to which the current output for Python to Java is Success - 1699, Errors - 0 instead Success - 0, Errors - 1699. In that case the CA is 0 for Python to Java. Please correct me if I am wrong. https://github.com/wasiahmad/AVATAR/blob/7bcabc9c7ff82190e878ed0fd3d93e9eb86fea2e/evaluation/compile.py#L90Thanks and Regards, Kunal Pagarey