Two questions about the AVATAR paper

weidotwisc commented 2 years ago

Hello,

Thank you very much for the great work!

In order to evaluate some the Code-ML models, I have two questions regarding the AVATAR paper (https://arxiv.org/pdf/2108.11590.pdf) :

(1) In Section 2: "To train models, we chose a maximum of k (we set k to 3 based on validation performances) solutions in each language to form a maximum of k^2 training examples and consider all the accepted solutions as reference translations for validation and testing." I am wondering how exactly these k^2 pairs are selected during training ? Assume there are 5 candidates in each language (JAVA/Python) for each problem, there are 25 pairs that one can choose from -- when k is set to 3, that means we ought to choose 9 out of these 25 pairs, I don't quite understand how exactly it is done as there are many ways to choose 9 pairs out of 25 pairs. Could you point to where in the code it is done ? Also, to validate and test, does it mean as long as the translated code matches with any one of the python reference targets (could be as many as 5), it should be counted as correct ?

(2) The caption of Table 2 reads: CA stands for Computational Accuracy. Does CA actually stand for Compilation Accuracy ?

Thanks!

Wei

wasiahmad commented 2 years ago

(1) We first select at least 20 solution examples per problem. Then we select a maximum of k of them to form k^2 pairs. How do we select k examples? The implementation is available here.

(2) Yes, it will be compilation accuracy.

weidotwisc commented 2 years ago

Hi,

Thank you very much for the quick response! I am sorry I didn't ask the first question more clearly -- I guess your response was about the second paragraph "Preprocessing & Filtering" of Section 2. My question was really about the 3rd paragraph of that section "Data Statistics." Assuming the preprocessing and filtering has been done (i.e., the train.jsonl, valid.json and test.jsonl have been generated), during the training, I paraphrase the 3rd paragraph here : "To train models, we chose a maximum of k (we set k to 3 based on validation performances) solutions in each language to form a maximum of k^2 training examples and consider all the accepted solutions as reference translations for validation and testing."

That is, what is the right source program and target program to send to the training model ? Assume 5 candidates of each language have been chosen in the preprocessing step, in the training step one would need to pick one from Python and one Java as one data item in the training dataset. I am curious how is this selection is done. Also, to validate and test, does it mean as long as the translated code matches with any one of the python reference targets (could be as many as 5), it should be counted as correct ?

Thanks!

Wei

wasiahmad commented 2 years ago

That is, what is the right source program and target program to send to the training model? Assume 5 candidates of each language have been chosen in the preprocessing step, in the training step one would need to pick one from Python and one Java as one data item in the training dataset. I am curious how is this selection is done.

https://github.com/wasiahmad/AVATAR/blob/main/data/split.py#L59

Also, to validate and test, does it mean as long as the translated code matches with any one of the python reference targets (could be as many as 5), it should be counted as correct ?

Yes.

weidotwisc commented 2 years ago

Thank you very much for the response!

Just want to double check, per

https://github.com/wasiahmad/AVATAR/blob/main/data/split.py#L99-102

single_prepare('train', args.k) single_prepare('valid', 1) single_prepare('test', 1)

It appears that args.k is 3 per Data Statistics paragraph. However, I am a bit confused that why k=1 is for validation dataset and test dataset. Per our discussion, k should be more like 5 (or as many as there are program candidates for valid / test dataset). Doesn't k=1 in this case mean we will only choose the first candidate in java and then the first candidate in python for each problem while ignoring all the other candidates ?

Thanks!

Wei

wasiahmad commented 2 years ago

For training, we generate k^2 instances. For evaluation, we use the k instances as they are. For metric computation, we pass the file that has all the 5 ground-truths. For example, CodeBERT evaluation. I suggest to explore the code a bit carefully.

weidotwisc commented 2 years ago

Thanks a lot for the response! I will look into the code more.

Thanks!

Wei

wasiahmad / AVATAR

Two questions about the AVATAR paper #7