Closed frede791 closed 1 year ago
Do you mean you encounter some errors during fine-tuning or the reproduced numbers are incorrect?
I can run fine-tuning just fine but the reproduced numbers I get do not match. This happens using the exact same hyperparameters specified in the readme.
For code search, we have two setting. One is reported in CodeBERT paper. Another is reported in other papers such as GraphCodeBERT and UniXcoder. If you refer to the setting reported in CodeBERT paper, the reproduced numbers should be almost same as the paper. Can you give some reproduced numbers you get?
I am referring to the first one. I have run it for Javascript and Go. The scores I've received are: go mrr: 0.0074802490516740604 javascript mrr: 0.0031123885931150517
These do not line up with the scores that are reported for the indiviual languages in the paper.
It seems that you don't fine-tune the CodeBERT. You can check training loss and whether they are normal. Besides, you need to check whether you have reloaded the model after fine-tuning.
I don't quite understand: Are you saying that you have to run fine-tuning and inference in one single call of run_classifier.py
? I assume by reloading the model you mean passing the --pred_model_dir parameter? This I have done as mentioned in the readme.
You need to fine-tune the model following this command. And then you need to check whether the training loss is normal. From your reproduced numbers, it seems that you don't fine-tune the model.
Finally, you can inference using this command. You need to reload the fine-tuned model instead of microsoft/codebert-base
I've checked the training scores and they seem to be in order. So using the --model_name_or_path
flag and the passing ./models/$lang
should work in the inference?
--pred_model_dir. Your reproduced results seem that the predictions are randomly guessed.
Oh but this is specified in the inference in the readme. Below are the exact parameters I've used:
for fine-tuning:
--model_type roberta \ --task_name codesearch \ --do_train \ --do_eval \ --eval_all_checkpoints \ --train_file train.txt \ --dev_file valid.txt \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --gradient_accumulation_steps 1 \ --overwrite_output_dir \ --data_dir ../codebert_data/codesearch/train_valid/$lang \ --output_dir ./models/$lang \ --model_name_or_path $pretrained_model
for inference:
--model_type roberta \ --model_name_or_path $pretrained_model\ --task_name codesearch \ --do_predict \ --output_dir ./models/$lang \ --data_dir ../codebert_data/codesearch/test/$lang \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --test_file test.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_result_dir ./results/$lang/${idx}_batch_result.txt
where $pretrained_model is defined before as microsoft/codebert-base and $lang as one of the available languages. Also data directories have been adjusted to my local path but I've double-checked that they are in fact stored correctly.
Yes. The parameters are correct. I also don't what problem happens. I suggest that you can check whether training loss and prediction scores are normal, because this results look worse than randomly guess.
Yes these are normal. This suggests that the problem lies with the inference.
Hello, I tried another test where I ran the inference step together with training. To do this I moved the test.txt file into the same folder as the training and validation sets and then ran run_classifier.py with both steps rolled into one.
here is the command I used:
lang=java #fine-tuning a language-specific model for each programming language pretrained_model=microsoft/codebert-base #Roberta: roberta-base idx=0
python run_classifier.py \ --model_type roberta \ --task_name codesearch \ --do_train \ --do_eval \ --eval_all_checkpoints \ --train_file train.txt \ --dev_file valid.txt \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --gradient_accumulation_steps 1 \ --overwrite_output_dir \ --data_dir ../codebert_data/codesearch/train_valid/$lang \ --output_dir ./models/$lang \ --model_name_or_path $pretrained_model \ --do_predict \ --test_result_dir ./results/$lang/${idx}_batch_result.txt \ --test_file test.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_file test.txt \
However, the scores are still not improving suggesting there is an issue with how the model is loaded prior the the inference taking place. The scores are: javascript mrr: 0.0031123885931150517 go mrr: 0.004439858735040062 java mrr: 0.0026782066123630685
Hello, I tried another test where I ran the inference step together with training. To do this I moved the test.txt file into the same folder as the training and validation sets and then ran run_classifier.py with both steps rolled into one.
here is the command I used:
lang=java #fine-tuning a language-specific model for each programming language pretrained_model=microsoft/codebert-base #Roberta: roberta-base idx=0
python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --eval_all_checkpoints --train_file train.txt --dev_file valid.txt --max_seq_length 200 --per_gpu_train_batch_size 32 --per_gpu_eval_batch_size 32 --learning_rate 1e-5 --num_train_epochs 8 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir ../codebert_data/codesearch/train_valid/$lang --output_dir ./models/$lang --model_name_or_path $pretrained_model --do_predict --test_result_dir ./results/$lang/${idx}_batch_result.txt --test_file test.txt --pred_model_dir ./models/$lang/checkpoint-best/ --test_file test.txt \
However, the scores are still not improving suggesting there is an issue with how the model is loaded prior the the inference taking place. The scores are: javascript mrr: 0.0031123885931150517 go mrr: 0.004439858735040062 java mrr: 0.0026782066123630685
I've also encountered this problem. In train and valid stage, the accuracy and f1 score look good. However, in test stage, f1 score is 0.00!
Testing: 1000000it [00:13, 71604.13it/s] acc = 0.974115 acc_and_f1 = 0.4870575 f1 = 0.0
Do you have any solution now? WAITING ONLINE NERVOUSLY.
Hi @Mr-Loevan @frede791,
Hello @fengzhangyin , Suppose that there is NO problem in training phase, I perform inference on batch 2&3 of java. Results as follows:
02/12/2023 14:55:40 - INFO - __main__ - ***** Output test results *****
Testing: 1000000it [00:09, 108864.08it/s]
acc = 0.958688
acc_and_f1 = 0.479344
f1 = 0.0
02/12/2023 16:07:45 - INFO - __main__ - ***** Output test results *****
Testing: 1000000it [00:11, 86637.52it/s]
acc = 0.953164
acc_and_f1 = 0.476582
f1 = 0.0
Then call mrr.py to calculate the MRR score:
./results/java/2_batch_result.txt
./results/java/3_batch_result.txt
java mrr: 0.005088787541109195
java mrr: 0.005088787541109195
The results are not as expected. Could you please replicate this confusing problem if possible.
I have done experiments on ruby and got normal results. I will repeat the experiment on java.
I repeated the experiment on java and got the following results:
./results/java/2_batch_result.txt
./results/java/3_batch_result.txt
java mrr: 0.7265479698286462
java mrr: 0.7265479698286462
I execute the following training script on two GPUs:
lang=java #fine-tuning a language-specific model for each programming language
pretrained_model=microsoft/codebert-base #Roberta: roberta-base
python3 run_classifier.py \
--model_type roberta \
--task_name codesearch \
--do_train \
--do_eval \
--eval_all_checkpoints \
--train_file train.txt \
--dev_file valid.txt \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--gradient_accumulation_steps 1 \
--overwrite_output_dir \--data_dir ../data/codesearch/train_valid/$lang \
--output_dir ./models/$lang \
--model_name_or_path $pretrained_model
These are the evaluation results during the training phase:
evaluate 0
acc = 0.8157233730223454
acc_and_f1 = 0.8195270500764581
f1 = 0.8233307271305708
evaluate 1
acc = 0.8222475941934432
acc_and_f1 = 0.829422294712699
f1 = 0.836596995231955
evaluate 2
acc = 0.8231936062632523
acc_and_f1 = 0.8270308666652966
f1 = 0.8308681270673408
evaluate 3
acc = 0.821236339911923
acc_and_f1 = 0.8268004887078735
f1 = 0.8323646375038238
evaluate 4
acc = 0.818300440384929
acc_and_f1 = 0.8250112473827969
f1 = 0.8317220543806647
evaluate 5
acc = 0.8194748001957266
acc_and_f1 = 0.824230601581176
f1 = 0.8289864029666255
evaluate 6
acc = 0.8161474473984668
acc_and_f1 = 0.8210608168876481
f1 = 0.8259741863768295
evaluate 7
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709
evaluate ./models/java/checkpoint-best
acc = 0.8231936062632523
acc_and_f1 = 0.8270308666652966
f1 = 0.8308681270673408
evaluate ./models/java/checkpoint-last
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709
evaluate ./models/java
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709
I execute the following inference script on a single GPU:
lang=java #programming language
idx=$1 #test batch idx
python3 run_classifier.py \
--model_type roberta \
--model_name_or_path microsoft/codebert-base \
--task_name codesearch \
--do_predict \
--output_dir ./models/$lang \
--data_dir ../data/codesearch/test/$lang \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--test_file batch_${idx}.txt \
--pred_model_dir ./models/$lang/checkpoint-best/ \
--test_result_dir ./results/$lang/${idx}_batch_result.txt
Then I call mrr.py and get the result of 0.7265 for the test batch 2&3 of java.
Thank you very much ! ! ! I've got normal results successfully. The results were bad possibly because I shuffled all ${idx}_batch_result.txt.
But I don't know why this dramatically impacted MRR. And what does score in mrr.py mean? I know the notion of MRR, but how to calculate rank with score?
correct_score = float(batch_data[batch_idx].strip().split('<CODESPLIT>')[-1])
scores = np.array([float(data.strip().split('<CODESPLIT>')[-1]) for data in batch_data])
rank = np.sum(scores >= correct_score)
ranks.append(rank)
And why there are two scores in results?
1<CODESPLIT>...<CODESPLIT>....<CODESPLIT>....<CODESPLIT>3.311647891998291<CODESPLIT>-3.0937719345092773
Excuse my ignorance, I am a beginner. Thank you for your kindness and patience.
In the test data, the correct answer of the i-th batch is the i-th position, and all the rest are wrong.
correct_score = float(batch_data[batch_idx].strip().split('<CODESPLIT>')[-1])
is the score of the correct answer.
rank = np.sum(scores >= correct_score)
indicates the number of scores higher than the correct answer, that is, the order of the correct answer.
NL-PL matching is formalized as a binary classification problem. The first score corresponds to category 0 (no match) and the second score corresponds to category 1 (match).
The comments of both of you made me feel enlightened. I want to know that when fine-tune CodeSearch in GraphCodeBert, is 图that similar to CodeBert?
The comments of both of you made me feel enlightened. I want to know that when fine-tune CodeSearch in GraphCodeBert, is 图that similar to CodeBert?
Yes. Just change microsoft/codebert-base to microsoft/graphcodebert-base
Hello again, I've tried to replicate the results using the executed scripts as shown by fenzhangyin however I am now getting a different error relating to non-existing paths:
`Traceback (most recent call last): File "run_classifier.py", line 287, in load_and_cache_examples features = torch.load(cached_features_file) File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 594, in load with _open_file_like(f, 'rb') as opened_file: File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like return _open_file(name_or_buffer, mode) File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 211, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: '../data/codesearch/test/go/cached_test_batch__pytorch_model.bin_200_codesearch'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_classifier.py", line 580, in
Hello,
I am currently trying to reproduce the results you stated for CodeBert codesearch. However, running the instructions found in the respective readme I am unable to replicate any of the scores you found. Is there any additional setup required?