Issues related to experimentation results

hmdgit commented 4 years ago

Hi,

I have constructed a new dataset [train.txt, test.txt, valid.txt] with the following format:

1<CODESPLIT>URL<CODESPLIT>returnType.methodName<CODESPLIT>[docString]<CODESPLIT>[code]

I have placed constant values such as “1”, “URL”, and ”returnType.methodName” for the whole dataset. When I run following script, I have gotten results such as [acc = 1.0, acc_and_f1 = 1.0, and f1 = 1.0]:

python CodeBERT/codesearch/run_classifier.py \
  --model_type roberta \
  --task_name codesearch \
  --do_train \
  --do_eval \
  --eval_all_checkpoints \
  --train_file train.txt \
  --dev_file valid.txt \
  --max_seq_length 200 \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 1e-5 \
  --num_train_epochs 8 \
  --gradient_accumulation_steps 1 \
  --overwrite_output_dir \
  --data_dir CodeBERT/data/train_valid\
  --output_dir CodeBERT/models  \
  --model_name_or_path CodeBERT/pretrained_models/pretrained_codebert

Following are the learning rate and loss graphs:

searchLosss searchLR

However, when I run following two scripts, I achieve MRR as 0.0031. I am not sure, why is it like that? Why it is so less MRR value?

python CodeBERT/codesearch/run_classifier.py \
--model_type roberta \
--model_name_or_path CodeBERT/models \
--task_name codesearch \
--do_predict \
--output_dir CodeBERT/data/train_valid \
--data_dir CodeBERT/data/train_valid \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--test_file test.txt \
--pred_model_dir CodeBERT/models \
--test_result_dir CodeBERT/results/result.txt

python CodeBERT/codesearch/mrr.py

Secondly, does Table 2 in the paper represent MRR values generated from the above scripts?

Finally, what is the difference between jsonl and text file format data? I guess jsonl format files are used in document generation experiments? For this purpose, I construct jsonl files having the same data but in jsonl format as follows. Only code_tokens and docstring_tokens contain token list of code snippet and natural langunge description. Is it a right approach?

`{"repo": "", "path": "", "func_name": "", "original_string": "", "language": "lang", "code": "", "code_tokens": [], "docstring": "", "docstring_tokens": [], "sha": "", "url": "", "partition": ""}

Kindly, let me know about my concerns. `

fengzhangyin commented 4 years ago

Hi, I guess there is a problem in the data preprocessing stage.

I don’t know if the new dataset you created contains negative examples. In our setting, both training and validation datasets are created in a way that positive and negative samples are balanced. Negative samples consist of balanced number of instances with randomly replaced NL and PL.
The training and validation datasets have the same format. In order to facilitate the calculation of MRR, the test dataset has a unique format. We provide a script(https://github.com/microsoft/CodeBERT/blob/master/codesearch/process_data.py) for how to construct the test dataset.
Yes, we use this script(https://github.com/microsoft/CodeBERT/blob/master/codesearch/mrr.py) to calculate MRR values.
In fact, there is no difference between two different formats data. As you understand, we just used code_tokens and docstring_tokens in the jsonl file. The jsonl format file is used to preserve other information.

hmdgit commented 4 years ago

Thanks for your response and clarification.

I have created a dataset in a way that I extract code methods along with its mentioned comments. Comments serve as code method's natural language description (NLD). Then, I randomly distribute all the pairs of code method and NLD into training, validation and testing sets. Finally, I got train.txt, valid.txt and test.txt files in the below mentioned format, whose description is defined previously:

1<CODESPLIT>URL<CODESPLIT>returnType.methodName<CODESPLIT>[docString]<CODESPLIT>[code]

I don't have an idea about positive and negative sampling. How can I distribute my dataset into positive and negative sampling? What is the purpose of this type of sampling? I also could be not be to understand this step in the paper as well. Do I need to randomly assign 1 and 0 numbers to each instance and make sure that in each training, validation and testing sets have balanced positive and negative samples?

Secondly, I have created a test data in the same format as training and validation set. Will it work? or do I need to perform some other steps?

Please let me know about your advise and guidance.

fengzhangyin commented 4 years ago

In this fintuning step, we learn the representations of code and natural language (NL) through a binary classification task. So the dataset should contain some positive examples (code and NL are from the same instance.) and some negative examples (code and NL are from different instances.).

Now, the dataset you created contains only positive examples (Each instanc is denoted as (c, w).). We can randomly replace code or NL to construct negative examples. In our settings, negative samples consist of balanced number of instances with randomly replaced NL (i.e. (c, wˆ)) and code (i.e. (cˆ, w)).

If you create test data in the same format as training and validation set, you can only get the classification accuracy. Maybe it's enough for you. We need to calculate MRR to be consistent with baselines, so we created a test set in accordance with baselines. You can decide for yourself whether to keep the data format consistent.

hmdgit commented 4 years ago

Thanks for the clarification.

I have now understood the purpose of sampling. I have used all the real pairs of code snippet and natural language description (NLD) and label them as positive [1]. Then, I shuffle NLD and attach one by one to code snippet and assign those instances as negative [0]. Does this looks fine for building training and validation datasets?
I have gone through process_data.py file and it seems to me that test data does not contain any negative sample. Am I right?
When I create a test file into a specified format "test_0.jsonl" file. It contains only around 1000 data instances. But, when I apply following script:

python process_data.py

It create around 2 GB text file named as batch_0.txt. I am confused why there are so many instances come from test_0.jsonl file in batch_0.txt file?

I can categorize my data into multiple classes and label them as [0,1,2,3]. I have changed [“0”,”1”] in utils.py file into [“0”,”1”,”2”,”3”], but issues are coming. Can Code-Bert model be fine tuned for multi-classification problem? If yes, what appropriate changes do I need to make inside the code?
It is mentioned in the CodeBERT paper that MRR has been calculated for 999 distractor codes. I could not be able to find this logic inside a code.

Can you please clarify all the above points and let me know about your kind feedback?

fengzhangyin commented 4 years ago

I think your current building method is appropriate.
As stated in the paper, MRR has been calculated for 999 distractor codes. So in the test set, each natural language query corresponds to 1000 candidate codes, of which only one is correct and the remaining 999 are distractor codes (test_batch_size=1000 in process_data.py ). Label is actually useless in subsequent calculations, so we set all labels to 1.
Each test batch contains 1000 instances. The codes in 1000 instaces are candidates for each natural language query, so 1000*1000 instances will be generated for each test batch. For the i-th data in the test batch, the index of the correct code among the 1000 candidate codes is i (The lable is useless, as discussed before).
Of course, you can fine tune CodeBERT for multi-classification problem. I think your modification should be enough. You can view the source code again. If necessary, you can post error messages.
You can find the logic in process_data.py.

hmdgit commented 4 years ago

Thanks for the clarification and guidance:

For multi-classification problem, I was facing following issue at utils.py:

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

However, I change it into the following and codebert successfully fine-tunes on multi-classification model:

f1 = f1_score(y_true=labels, y_pred=preds,average='weighted')

For a document generation task, I am confused in preparing a dataset. For this purpose, I use the real pair of natural language description and code snippets and place them in jsonl.gz format. As it seems that there is no place for negative instances in json.gz format. So, I don't use them. Is it right?
In order to finetune CODEBERT (MLM, INIT=ROBERTA), I need to create a dataset training, validation (train.txt and valid.txt), in a way that I will replace 15% tokens in each pair of natural language description and code snippet with keyword and it will be same for negative and positive samples as described above and fine tune them by using model shared by you as 'pretrained_codebert(mlm)'. Is it right? Secondly, what about test.txt, should I use the same strategy as used in train.txt and valid.txt, to calculate MRR?
How can I prepare dataset for CODEBERT (RTD, INIT=ROBERTA)? It seems that for RTD, I need to prepare unimodal data set (train.txt, test.txt, valid.txt) with code snippet and empty doc_string? Should I use 'pretrained_codebert' model for finetuning? and do I need to use the same fine-tuning script or there will be another script for this purpose?
For 'CODEBERT (MLM+RTD, INIT=ROBERTA)', should I use pretrained model named as 'pretrained_codebert (MLM)' or 'pretrained_codebert'? How can I prepare dataset for this purpose?
In section 4.2 on NL-PL probing, to evaluate on the NL side it is mentioned that the input includes the complete code and a masked NL documentation. Similarly, to evaluate on the PL side, it is mentioned that the input includes the complete NL documentation and masked code. However, in the following provided script, there is no place to input both. How can I provide both inputs?

from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline
model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")
CODE = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(CODE)
print(outputs)

Please let me know about your advise and guidance.

fengzhangyin commented 4 years ago

I am very happy to hear that you can perform multi-classification finetune.
You are right. We aim to generate the correct document, so it's right to only use real data pairs to fine-tune the model.
Sorry, I don't understand your intention. Do you want to use your data set to continue pre-training CodeBERT(MLM) with MLM objective? You don't need to mask the data if you just fine-tune the pre-trained CodeBERT(MLM). You can process the test set in the same way as used in train.txt and valid.txt. I think it is more appropriate for you to calculate accuracy and F score and you can get these two metrics directly using our codes.
For different pre-trained models, you can use the same data and script.
'pretrained_codebert' is 'CODEBERT (MLM+RTD, INIT=ROBERTA)'. As mentioned in 4，you can use the same data and script to fine-tune different pre-trained models.
Here we simply give an example. You can directly put the two parts together as input.

hmdgit commented 4 years ago

Thanks for the clarification.

Can you please let me know about the following concerns:

What is the difference between acc and acc_and_f1, which are reported when model is finetuned? Are you reporting these two metrics in the paper?
Do I need to use the same script of MRR for a model trained on multi-classfication data? or do I need consider for some changes?
If 'pretrained_codebert' model represents CODEBERT (MLM+RTD, INIT=ROBERTA) and 'pretrained_codebert(mlm)' represents CODEBERT (MLM, INIT=ROBERTA) then which pretrained model represents CODEBERT (RTD, INIT=ROBERTA)? I could be able to find this pretrained model in the shared link of Google Drive.
How can I directly put two parts (code and nld) to get probing results? Do I need to separate both of them with ? Is the following example looks ok? or what should I do? Can you please give some example, i it is wrong?

CODE = "conditional statement <CODESPLIT> if (x is not None) <mask> (x>1)"

fengzhangyin commented 4 years ago

You can find the implementation of acc and acc_and_f1 in utils.py. 'acc_and f1' is the average score of ACC and F1. You can choose to use any one as the metric. In our data, positive and negative examples are balanced, so we use acc as the metric.
I think it is more appropriate for you to calculate accuracy or F1 score.
Sorry, we don't release the CODEBERT(RTD, INIT=ROBERTA). We think the two released models are more useful to the machine learning community.
We directly put two parts together without using a separator. Here, we give an example. NL and PL are from here.
```
from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline
```

model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")

NL = "Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored." PL = "@Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.(timeGradient, nextTimeGradient); } } return timeGradient; }"

CODE = NL + " " + PL fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(CODE) print(outputs)

Outputs
```python
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.max(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.9246102571487427, 'token': 29459}
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math. max(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.035343579947948456, 'token': 19220}
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.Max(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.013716962188482285, 'token': 19854}
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.min(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.009721478447318077, 'token': 4691}
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.MAX(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.005634027533233166, 'token': 30187}

hmdgit commented 4 years ago

Thanks for the clarification and kind cooperation.

Can you please let me know about the following concerns?

In my dataset, there are around 10K instances of positive samples and 10K instances of negative samples belonging to a single programming language. 1K instances are of negative samples and 1K instances are of positive samples. When I fine-tune 'CODEBERT (MLM+RTD, INIT=ROBERTA)' model for my own dataset (Binary Classification problem), it is giving me an accuracy (acc) on validation set with best fine-tuned model as 0.99, but when I run MRR on test set (originally contains 1K instances) it is giving me a value as 15%, which is not very good. Which hyper-parameters do I try to tune to get better MRR? or is it necessary to have more data for having good MRR value?
I am looking for a code snippet in which, when I pass NL, it should search and give me top-k related code snippets? Similarly, when I pass code snippet, it should search and give me top-k related NL.

fengzhangyin commented 4 years ago

I think the model has learned very well because the accuracy reached 0.99 on validation set. You may need to check the process of calculating MRR.
Sorry, we do not plan to release the related code.

microsoft / CodeBERT

Issues related to experimentation results #8