Closed hmdgit closed 4 years ago
Hi, I guess there is a problem in the data preprocessing stage.
Thanks for your response and clarification.
I have created a dataset in a way that I extract code methods along with its mentioned comments. Comments serve as code method's natural language description (NLD). Then, I randomly distribute all the pairs of code method and NLD into training, validation and testing sets. Finally, I got train.txt, valid.txt and test.txt files in the below mentioned format, whose description is defined previously:
1<CODESPLIT>URL<CODESPLIT>returnType.methodName<CODESPLIT>[docString]<CODESPLIT>[code]
I don't have an idea about positive and negative sampling. How can I distribute my dataset into positive and negative sampling? What is the purpose of this type of sampling? I also could be not be to understand this step in the paper as well. Do I need to randomly assign 1 and 0 numbers to each instance and make sure that in each training, validation and testing sets have balanced positive and negative samples?
Secondly, I have created a test data in the same format as training and validation set. Will it work? or do I need to perform some other steps?
Please let me know about your advise and guidance.
In this fintuning step, we learn the representations of code and natural language (NL) through a binary classification task. So the dataset should contain some positive examples (code and NL are from the same instance.) and some negative examples (code and NL are from different instances.).
Now, the dataset you created contains only positive examples (Each instanc is denoted as (c, w).). We can randomly replace code or NL to construct negative examples. In our settings, negative samples consist of balanced number of instances with randomly replaced NL (i.e. (c, wˆ)) and code (i.e. (cˆ, w)).
If you create test data in the same format as training and validation set, you can only get the classification accuracy. Maybe it's enough for you. We need to calculate MRR to be consistent with baselines, so we created a test set in accordance with baselines. You can decide for yourself whether to keep the data format consistent.
Thanks for the clarification.
I have now understood the purpose of sampling. I have used all the real pairs of code snippet and natural language description (NLD) and label them as positive [1]. Then, I shuffle NLD and attach one by one to code snippet and assign those instances as negative [0]. Does this looks fine for building training and validation datasets?
I have gone through process_data.py file and it seems to me that test data does not contain any negative sample. Am I right?
When I create a test file into a specified format "test_0.jsonl" file. It contains only around 1000 data instances. But, when I apply following script:
python process_data.py
It create around 2 GB text file named as batch_0.txt. I am confused why there are so many instances come from test_0.jsonl file in batch_0.txt file?
I can categorize my data into multiple classes and label them as [0,1,2,3]. I have changed [“0”,”1”] in utils.py file into [“0”,”1”,”2”,”3”], but issues are coming. Can Code-Bert model be fine tuned for multi-classification problem? If yes, what appropriate changes do I need to make inside the code?
It is mentioned in the CodeBERT paper that MRR has been calculated for 999 distractor codes. I could not be able to find this logic inside a code.
Can you please clarify all the above points and let me know about your kind feedback?
Thanks for the clarification and guidance:
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
However, I change it into the following and codebert successfully fine-tunes on multi-classification model:
f1 = f1_score(y_true=labels, y_pred=preds,average='weighted')
For a document generation task, I am confused in preparing a dataset. For this purpose, I use the real pair of natural language description and code snippets and place them in jsonl.gz format. As it seems that there is no place for negative instances in json.gz format. So, I don't use them. Is it right?
In order to finetune CODEBERT (MLM, INIT=ROBERTA), I need to create a dataset training, validation (train.txt and valid.txt), in a way that I will replace 15% tokens in each pair of natural language description and code snippet with
How can I prepare dataset for CODEBERT (RTD, INIT=ROBERTA)? It seems that for RTD, I need to prepare unimodal data set (train.txt, test.txt, valid.txt) with code snippet and empty doc_string? Should I use 'pretrained_codebert' model for finetuning? and do I need to use the same fine-tuning script or there will be another script for this purpose?
For 'CODEBERT (MLM+RTD, INIT=ROBERTA)', should I use pretrained model named as 'pretrained_codebert (MLM)' or 'pretrained_codebert'? How can I prepare dataset for this purpose?
In section 4.2 on NL-PL probing, to evaluate on the NL side it is mentioned that the input includes the complete code and a masked NL documentation. Similarly, to evaluate on the PL side, it is mentioned that the input includes the complete NL documentation and masked code. However, in the following provided script, there is no place to input both. How can I provide both inputs?
from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline
model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")
CODE = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(CODE)
print(outputs)
Please let me know about your advise and guidance.
Thanks for the clarification.
Can you please let me know about the following concerns:
CODE = "conditional statement <CODESPLIT> if (x is not None) <mask> (x>1)"
from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline
model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")
NL = "Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored."
PL = "@Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.
CODE = NL + " " + PL fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(CODE) print(outputs)
Outputs
```python
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.max(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.9246102571487427, 'token': 29459}
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math. max(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.035343579947948456, 'token': 19220}
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.Max(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.013716962188482285, 'token': 19854}
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.min(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.009721478447318077, 'token': 4691}
{'sequence': '<s> Calculates the maximum timeGradient of all Terminations. Not supported timeGradients (-1.0) are ignored. @Override public double calculatePhaseTimeGradient(AbstractPhaseScope phaseScope) { double timeGradient = 0.0; for (Termination termination : terminationList) { double nextTimeGradient = termination.calculatePhaseTimeGradient(phaseScope); if (nextTimeGradient >= 0.0) { timeGradient = Math.MAX(timeGradient, nextTimeGradient); } } return timeGradient; }</s>', 'score': 0.005634027533233166, 'token': 30187}
Thanks for the clarification and kind cooperation.
Can you please let me know about the following concerns?
In my dataset, there are around 10K instances of positive samples and 10K instances of negative samples belonging to a single programming language. 1K instances are of negative samples and 1K instances are of positive samples. When I fine-tune 'CODEBERT (MLM+RTD, INIT=ROBERTA)' model for my own dataset (Binary Classification problem), it is giving me an accuracy (acc) on validation set with best fine-tuned model as 0.99, but when I run MRR on test set (originally contains 1K instances) it is giving me a value as 15%, which is not very good. Which hyper-parameters do I try to tune to get better MRR? or is it necessary to have more data for having good MRR value?
I am looking for a code snippet in which, when I pass NL, it should search and give me top-k related code snippets? Similarly, when I pass code snippet, it should search and give me top-k related NL.
Hi,
I have constructed a new dataset [train.txt, test.txt, valid.txt] with the following format:
1<CODESPLIT>URL<CODESPLIT>returnType.methodName<CODESPLIT>[docString]<CODESPLIT>[code]
I have placed constant values such as “1”, “URL”, and ”returnType.methodName” for the whole dataset. When I run following script, I have gotten results such as [acc = 1.0, acc_and_f1 = 1.0, and f1 = 1.0]:
Following are the learning rate and loss graphs:
However, when I run following two scripts, I achieve MRR as 0.0031. I am not sure, why is it like that? Why it is so less MRR value?
python CodeBERT/codesearch/mrr.py
Secondly, does Table 2 in the paper represent MRR values generated from the above scripts?
Finally, what is the difference between jsonl and text file format data? I guess jsonl format files are used in document generation experiments? For this purpose, I construct jsonl files having the same data but in jsonl format as follows. Only code_tokens and docstring_tokens contain token list of code snippet and natural langunge description. Is it a right approach?
`{"repo": "", "path": "", "func_name": "", "original_string": "", "language": "lang", "code": "", "code_tokens": [], "docstring": "", "docstring_tokens": [], "sha": "", "url": "", "partition": ""}
Kindly, let me know about my concerns. `