microsoft / CodeBERT

CodeBERT
MIT License
2.16k stars 446 forks source link

Inability to reproduce CodeBert codesearch results #215

Closed frede791 closed 1 year ago

frede791 commented 1 year ago

Hello,

I am currently trying to reproduce the results you stated for CodeBert codesearch. However, running the instructions found in the respective readme I am unable to replicate any of the scores you found. Is there any additional setup required?

guoday commented 1 year ago

Do you mean you encounter some errors during fine-tuning or the reproduced numbers are incorrect?

frede791 commented 1 year ago

I can run fine-tuning just fine but the reproduced numbers I get do not match. This happens using the exact same hyperparameters specified in the readme.

guoday commented 1 year ago

For code search, we have two setting. One is reported in CodeBERT paper. Another is reported in other papers such as GraphCodeBERT and UniXcoder. If you refer to the setting reported in CodeBERT paper, the reproduced numbers should be almost same as the paper. Can you give some reproduced numbers you get?

frede791 commented 1 year ago

I am referring to the first one. I have run it for Javascript and Go. The scores I've received are: go mrr: 0.0074802490516740604 javascript mrr: 0.0031123885931150517

These do not line up with the scores that are reported for the indiviual languages in the paper.

guoday commented 1 year ago

It seems that you don't fine-tune the CodeBERT. You can check training loss and whether they are normal. Besides, you need to check whether you have reloaded the model after fine-tuning.

frede791 commented 1 year ago

I don't quite understand: Are you saying that you have to run fine-tuning and inference in one single call of run_classifier.py? I assume by reloading the model you mean passing the --pred_model_dir parameter? This I have done as mentioned in the readme.

guoday commented 1 year ago

You need to fine-tune the model following this command. And then you need to check whether the training loss is normal. From your reproduced numbers, it seems that you don't fine-tune the model.

Finally, you can inference using this command. You need to reload the fine-tuned model instead of microsoft/codebert-base

frede791 commented 1 year ago

I've checked the training scores and they seem to be in order. So using the --model_name_or_path flag and the passing ./models/$lang should work in the inference?

guoday commented 1 year ago

--pred_model_dir. Your reproduced results seem that the predictions are randomly guessed.

frede791 commented 1 year ago

Oh but this is specified in the inference in the readme. Below are the exact parameters I've used:

for fine-tuning:

--model_type roberta \ --task_name codesearch \ --do_train \ --do_eval \ --eval_all_checkpoints \ --train_file train.txt \ --dev_file valid.txt \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --gradient_accumulation_steps 1 \ --overwrite_output_dir \ --data_dir ../codebert_data/codesearch/train_valid/$lang \ --output_dir ./models/$lang \ --model_name_or_path $pretrained_model

for inference:

--model_type roberta \ --model_name_or_path $pretrained_model\ --task_name codesearch \ --do_predict \ --output_dir ./models/$lang \ --data_dir ../codebert_data/codesearch/test/$lang \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --test_file test.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_result_dir ./results/$lang/${idx}_batch_result.txt

where $pretrained_model is defined before as microsoft/codebert-base and $lang as one of the available languages. Also data directories have been adjusted to my local path but I've double-checked that they are in fact stored correctly.

guoday commented 1 year ago

Yes. The parameters are correct. I also don't what problem happens. I suggest that you can check whether training loss and prediction scores are normal, because this results look worse than randomly guess.

frede791 commented 1 year ago

Yes these are normal. This suggests that the problem lies with the inference.

frede791 commented 1 year ago

Hello, I tried another test where I ran the inference step together with training. To do this I moved the test.txt file into the same folder as the training and validation sets and then ran run_classifier.py with both steps rolled into one.

here is the command I used:

lang=java #fine-tuning a language-specific model for each programming language pretrained_model=microsoft/codebert-base #Roberta: roberta-base idx=0

python run_classifier.py \ --model_type roberta \ --task_name codesearch \ --do_train \ --do_eval \ --eval_all_checkpoints \ --train_file train.txt \ --dev_file valid.txt \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --gradient_accumulation_steps 1 \ --overwrite_output_dir \ --data_dir ../codebert_data/codesearch/train_valid/$lang \ --output_dir ./models/$lang \ --model_name_or_path $pretrained_model \ --do_predict \ --test_result_dir ./results/$lang/${idx}_batch_result.txt \ --test_file test.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_file test.txt \

However, the scores are still not improving suggesting there is an issue with how the model is loaded prior the the inference taking place. The scores are: javascript mrr: 0.0031123885931150517 go mrr: 0.004439858735040062 java mrr: 0.0026782066123630685

Mr-Loevan commented 1 year ago

Hello, I tried another test where I ran the inference step together with training. To do this I moved the test.txt file into the same folder as the training and validation sets and then ran run_classifier.py with both steps rolled into one.

here is the command I used:

lang=java #fine-tuning a language-specific model for each programming language pretrained_model=microsoft/codebert-base #Roberta: roberta-base idx=0

python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --eval_all_checkpoints --train_file train.txt --dev_file valid.txt --max_seq_length 200 --per_gpu_train_batch_size 32 --per_gpu_eval_batch_size 32 --learning_rate 1e-5 --num_train_epochs 8 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir ../codebert_data/codesearch/train_valid/$lang --output_dir ./models/$lang --model_name_or_path $pretrained_model --do_predict --test_result_dir ./results/$lang/${idx}_batch_result.txt --test_file test.txt --pred_model_dir ./models/$lang/checkpoint-best/ --test_file test.txt \

However, the scores are still not improving suggesting there is an issue with how the model is loaded prior the the inference taking place. The scores are: javascript mrr: 0.0031123885931150517 go mrr: 0.004439858735040062 java mrr: 0.0026782066123630685

I've also encountered this problem. In train and valid stage, the accuracy and f1 score look good. However, in test stage, f1 score is 0.00!

Testing: 1000000it [00:13, 71604.13it/s] acc = 0.974115 acc_and_f1 = 0.4870575 f1 = 0.0

Do you have any solution now? WAITING ONLINE NERVOUSLY.

fengzhangyin commented 1 year ago

Hi @Mr-Loevan @frede791,

  1. I speculate that there are some problems only in the testing phase, because the loss shows that your training phase is normal.
  2. What I want to emphasize is that the data format of the training phase and the test phase are different, and the evaluation methods are also different. In the training phase, we are performing two classifications on NL-PL pair, and we use F1 score as the evaluation metric. In the test phase, we follow the official evaluation metric to calculate the Mean Reciprocal Rank (MRR) for each pair of test data (c, w) over a fixed set of 999 distractor codes.
  3. In order to get the correct resul, taking the java language as an example, three steps are required:
    1. Finetune the model on the java training data.
    2. Perform inference on all test batches of java.
    3. Call mrr.py to calculate the MRR score.
Mr-Loevan commented 1 year ago

Hello @fengzhangyin , Suppose that there is NO problem in training phase, I perform inference on batch 2&3 of java. Results as follows:

02/12/2023 14:55:40 - INFO - __main__ -   ***** Output test results *****
Testing: 1000000it [00:09, 108864.08it/s]
acc = 0.958688
acc_and_f1 = 0.479344
f1 = 0.0
02/12/2023 16:07:45 - INFO - __main__ -   ***** Output test results *****
Testing: 1000000it [00:11, 86637.52it/s]
acc = 0.953164
acc_and_f1 = 0.476582
f1 = 0.0

Then call mrr.py to calculate the MRR score:

./results/java/2_batch_result.txt
./results/java/3_batch_result.txt
java mrr: 0.005088787541109195
java mrr: 0.005088787541109195

The results are not as expected. Could you please replicate this confusing problem if possible.

fengzhangyin commented 1 year ago

I have done experiments on ruby and got normal results. I will repeat the experiment on java.

fengzhangyin commented 1 year ago

I repeated the experiment on java and got the following results:

./results/java/2_batch_result.txt
./results/java/3_batch_result.txt
java mrr: 0.7265479698286462
java mrr: 0.7265479698286462

I execute the following training script on two GPUs:

lang=java #fine-tuning a language-specific model for each programming language 
pretrained_model=microsoft/codebert-base  #Roberta: roberta-base

python3 run_classifier.py \
--model_type roberta \
--task_name codesearch \
--do_train \
--do_eval \
--eval_all_checkpoints \
--train_file train.txt \
--dev_file valid.txt \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--gradient_accumulation_steps 1 \
--overwrite_output_dir \--data_dir ../data/codesearch/train_valid/$lang \
--output_dir ./models/$lang  \
--model_name_or_path $pretrained_model

These are the evaluation results during the training phase:

evaluate 0
acc = 0.8157233730223454
acc_and_f1 = 0.8195270500764581
f1 = 0.8233307271305708
evaluate 1
acc = 0.8222475941934432
acc_and_f1 = 0.829422294712699
f1 = 0.836596995231955
evaluate 2
acc = 0.8231936062632523
acc_and_f1 = 0.8270308666652966
f1 = 0.8308681270673408
evaluate 3
acc = 0.821236339911923
acc_and_f1 = 0.8268004887078735
f1 = 0.8323646375038238
evaluate 4
acc = 0.818300440384929
acc_and_f1 = 0.8250112473827969
f1 = 0.8317220543806647
evaluate 5
acc = 0.8194748001957266
acc_and_f1 = 0.824230601581176
f1 = 0.8289864029666255
evaluate 6
acc = 0.8161474473984668
acc_and_f1 = 0.8210608168876481
f1 = 0.8259741863768295
evaluate 7
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709
evaluate ./models/java/checkpoint-best
acc = 0.8231936062632523
acc_and_f1 = 0.8270308666652966
f1 = 0.8308681270673408
evaluate ./models/java/checkpoint-last
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709
evaluate ./models/java
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709

I execute the following inference script on a single GPU:

lang=java #programming language
idx=$1 #test batch idx

python3 run_classifier.py \
--model_type roberta \
--model_name_or_path microsoft/codebert-base \
--task_name codesearch \
--do_predict \
--output_dir ./models/$lang \
--data_dir ../data/codesearch/test/$lang \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--test_file batch_${idx}.txt \
--pred_model_dir ./models/$lang/checkpoint-best/ \
--test_result_dir ./results/$lang/${idx}_batch_result.txt

Then I call mrr.py and get the result of 0.7265 for the test batch 2&3 of java.

Mr-Loevan commented 1 year ago

Thank you very much ! ! ! I've got normal results successfully. The results were bad possibly because I shuffled all ${idx}_batch_result.txt.

But I don't know why this dramatically impacted MRR. And what does score in mrr.py mean? I know the notion of MRR, but how to calculate rank with score?

    correct_score = float(batch_data[batch_idx].strip().split('<CODESPLIT>')[-1])
    scores = np.array([float(data.strip().split('<CODESPLIT>')[-1]) for data in batch_data])
    rank = np.sum(scores >= correct_score)
    ranks.append(rank)

And why there are two scores in results?

1<CODESPLIT>...<CODESPLIT>....<CODESPLIT>....<CODESPLIT>3.311647891998291<CODESPLIT>-3.0937719345092773

Excuse my ignorance, I am a beginner. Thank you for your kindness and patience.

fengzhangyin commented 1 year ago

In the test data, the correct answer of the i-th batch is the i-th position, and all the rest are wrong. correct_score = float(batch_data[batch_idx].strip().split('<CODESPLIT>')[-1]) is the score of the correct answer. rank = np.sum(scores >= correct_score) indicates the number of scores higher than the correct answer, that is, the order of the correct answer.

NL-PL matching is formalized as a binary classification problem. The first score corresponds to category 0 (no match) and the second score corresponds to category 1 (match).

wangsiqidahaoren commented 1 year ago

The comments of both of you made me feel enlightened. I want to know that when fine-tune CodeSearch in GraphCodeBert, is 图that similar to CodeBert?

guoday commented 1 year ago

The comments of both of you made me feel enlightened. I want to know that when fine-tune CodeSearch in GraphCodeBert, is 图that similar to CodeBert?

Yes. Just change microsoft/codebert-base to microsoft/graphcodebert-base

frede791 commented 1 year ago

Hello again, I've tried to replicate the results using the executed scripts as shown by fenzhangyin however I am now getting a different error relating to non-existing paths:

`Traceback (most recent call last): File "run_classifier.py", line 287, in load_and_cache_examples features = torch.load(cached_features_file) File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 594, in load with _open_file_like(f, 'rb') as opened_file: File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like return _open_file(name_or_buffer, mode) File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 211, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: '../data/codesearch/test/go/cached_test_batch__pytorch_model.bin_200_codesearch'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "run_classifier.py", line 580, in main() File "run_classifier.py", line 575, in main evaluate(args, model, tokenizer, checkpoint=None, prefix='', mode='test') File "run_classifier.py", line 192, in evaluate eval_dataset, instances = load_and_cache_examples(args, eval_task, tokenizer, ttype='test') File "run_classifier.py", line 298, in load_and_cache_examples examples, instances = processor.get_test_examples(args.data_dir, args.test_file) File "/usr/itetnas04/data-scratch-01/frede791/data/codebert_master/utils.py", line 93, in get_test_examples self._read_tsv(os.path.join(data_dir, test_file)), "test") File "/usr/itetnas04/data-scratch-01/frede791/data/codebert_master/utils.py", line 64, in _read_tsv with open(inputfile, "r", encoding='utf-8') as f: FileNotFoundError: [Errno 2] No such file or directory: '../data/codesearch/test/go/batch.txt' `