[Code-Code] Defect Prediciton --> pretraining with custom datasets

mhyeonsoo commented 3 years ago

Hi, thanks for sharing great source.

I am trying to implement this model to train with my own dataset so that the model can perform for bit different task. What I am putting effort on is, pretrain the roberta model with my own dataset with the command mentioned at Readme guide.

Due to the limit of my GPU resources, I decrease the batch size into half for both training and evaluation. However, after training, when I try to do inference, it returns the warnings like below.

Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaForSequenceClassification: ['pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

I think the last one, which is classifier's output layers' weights are re-initialization, looks bad, because I could see all the confidence scores for both good / bad input are almost similar.

Could you tell me if there is any problem in my process or about the result of warning? Thank you.

guoday commented 3 years ago

You need to fine-tune the model on defect prediciton dataset. You can follow this readme to perform defect prediction. When evaluating, the fine-tuned model will be reloaded here

mhyeonsoo commented 3 years ago

Thanks, I have looked through and referred the readme file you've mentioned, but still facing errors.

my question is, on which point, I can use my own dataset? now, I made json files for my data, and I trained with the command like below.

python3 code/run.py \
    --output_dir=${output_dir} \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --do_train=True \
    --do_eval=True \
    --train_data_file=${train_data_file} \
    --eval_data_file=${eval_data_file} \
    --test_data_file=${test_data_file} \
    --epoch 10 \
    --block_size 400 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training=True \
    --seed 123456  2>&1 | tee ${logfile}

and it outputs error I mentioned above.

Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaForSequenceClassification: ['pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']

And when I specify the model directory that I have trained with the custom dataset like below,

python code/run.py \
    --output_dir=${output_dir} \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=${output_dir}/checkpoint-best-acc/model.bin \
    --do_eval=False \
    --do_test=True \
    --train_data_file=${train_data_file} \
    --eval_data_file=${eval_data_file} \
    --test_data_file=${specific_commit_file} \
    --epoch 5 \
    --block_size 400 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training=True \
    --seed 123456 2>&1 | tee ${log_file}

it returns codec error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Where should I use defect prediction dataset for fine-tune?

Thanks again,

mhyeonsoo commented 3 years ago

I added a comments after experimenting with the trained model (by custom data)

When I ran the test script with the command of

python3 code/run.py \
    --output_dir=${output_dir} \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --do_test=True \
    --train_data_file=${train_data_file} \
    --eval_data_file=${eval_data_file} \
    --test_data_file=${specific_commit_file} \
    --epoch 5 \
    --block_size 400 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training=True \
    --seed 123456 2>&1 | tee ${log_file}

the value of 'logits' are all too small like below

array([[0.06392875],
       [0.06392875]], dtype=float32)

It seems like the model was not trained well.

Is there any solution for this?

Thanks,

guoday commented 3 years ago

Thanks, I have looked through and referred the readme file you've mentioned, but still facing errors.

my question is, on which point, I can use my own dataset? now, I made json files for my data, and I trained with the command like below.

python3 code/run.py \
    --output_dir=${output_dir} \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --do_train=True \
    --do_eval=True \
    --train_data_file=${train_data_file} \
    --eval_data_file=${eval_data_file} \
    --test_data_file=${test_data_file} \
    --epoch 10 \
    --block_size 400 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training=True \
    --seed 123456  2>&1 | tee ${logfile}

and it outputs error I mentioned above.

Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaForSequenceClassification: ['pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']

And when I specify the model directory that I have trained with the custom dataset like below,

python code/run.py \
    --output_dir=${output_dir} \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=${output_dir}/checkpoint-best-acc/model.bin \
    --do_eval=False \
    --do_test=True \
    --train_data_file=${train_data_file} \
    --eval_data_file=${eval_data_file} \
    --test_data_file=${specific_commit_file} \
    --epoch 5 \
    --block_size 400 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training=True \
    --seed 123456 2>&1 | tee ${log_file}

it returns codec error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Where should I use defect prediction dataset for fine-tune?

Thanks again,

Sorry for late reply and I don't receive any e-mail about this. About this problem, maybe you need to check the encoding about your dataset. You can also add encoding="utf-8" when loading the dataset.

guoday commented 3 years ago

I added a comments after experimenting with the trained model (by custom data)

When I ran the test script with the command of

python3 code/run.py \
    --output_dir=${output_dir} \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --do_test=True \
    --train_data_file=${train_data_file} \
    --eval_data_file=${eval_data_file} \
    --test_data_file=${specific_commit_file} \
    --epoch 5 \
    --block_size 400 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training=True \
    --seed 123456 2>&1 | tee ${log_file}

the value of 'logits' are all too small like below

array([[0.06392875],
       [0.06392875]], dtype=float32)

It seems like the model was not trained well.

Is there any solution for this?

Thanks,

Can you give me your training log? Maybe you need to observe whether the training loss is normal.

mhyeonsoo commented 3 years ago

Thanks for the response.

For the encoding issue, I am converting c++ source code lines gathered with 'git blame' command into jsonl directly. This may or may not cause encoding error, but it looked okay with me. I will check that again. I am attaching one example of the training code I made.

{"project": "gitrisky", "commit_id": ["4213dd12a", "0412da23a", "51d301a22"], "target": 0, "func": "-0400 11) static const Temperature_t maxProbeTemperature = 200; -0400 12) static const Temperature_t defaultProbeTemperature = 150; -0400 13) static const Temperature_t minProbeTemperature = 100; -0400 14)  -0400 34)    instance->interface.maxTemperature = maxProbeTemperature; -0400 35)    instance->interface.defaultTemperature = defaultProbeTemperature; -0400 36)    instance->interface.minTemperature = minProbeTemperature; ", "idx": 18596}

For the model case, it seemed like the model was overfitted.

But I don't think that can totally mess up the inference performance just like above that the logits have the meaningless score. I am attaching the part of the training log here.

Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaForSequenceClassification: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
10/25/2021 14:25:47 - INFO - __main__ -   *** Example ***
10/25/2021 14:25:47 - INFO - __main__ -   idx: 0
10/25/2021 14:25:47 - INFO - __main__ -   label: 0
10/25/2021 14:25:47 - INFO - __main__ -   input_tokens: ['<s>', '-', '04', '00', '_20', ')', '_-', '04', '00', '_153', ')', '_||', '_(', 'data', 'To', 'Return', 'On', 'Read', '[', 'index', ']', '_>', '_inputs', '[', 'index', '].', 'max', 'Value', '))', '_break', ';', '</s>']
10/25/2021 14:25:47 - INFO - __main__ -   input_ids: 0 12 3387 612 291 43 111 3387 612 25758 43 45056 36 23687 3972 42555 4148 25439 10975 18480 742 8061 16584 10975 18480 8174 29459 33977 35122 1108 131 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10/25/2021 14:25:47 - INFO - __main__ -   *** Example ***
10/25/2021 14:25:47 - INFO - __main__ -   idx: 1
10/25/2021 14:25:47 - INFO - __main__ -   label: 0
10/25/2021 14:25:47 - INFO - __main__ -   input_tokens: ['<s>', '-', '05', '00', '_20', ')', '_Ring', 'Buffer', '_', 'Add', '(&', 'self', '->', '_', 'private', '.', 'ring', 'Buffer', ',', '_(', 'void', '_*)', '&', 'on', 'Re', 'ceive', 'Args', '->', 'byte', ');', '</s>']
10/25/2021 14:25:47 - INFO - __main__ -   input_ids: 0 12 2546 612 291 43 11533 49334 1215 20763 49763 13367 46613 1215 22891 4 4506 49334 6 36 47908 49521 947 261 9064 6550 49919 46613 47692 4397 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10/25/2021 14:25:47 - INFO - __main__ -   *** Example ***
10/25/2021 14:25:47 - INFO - __main__ -   idx: 2
10/25/2021 14:25:47 - INFO - __main__ -   label: 0
10/25/2021 14:25:47 - INFO - __main__ -   input_tokens: ['<s>', '-', '04', '00', '_53', ')', '_uint', '8', '_', 't', '_data', 'To', 'Read', '[', '200', ']=', '{', '0', '};', '_-', '04', '00', '_59', ')', '_uint', '8', '_', 't', '_data', 'To', 'Read', '[', '200', ']=', '{', '0', '};', '</s>']
10/25/2021 14:25:47 - INFO - __main__ -   input_ids: 0 12 3387 612 4268 43 49315 398 1215 90 414 3972 25439 10975 2619 49659 45152 288 49423 111 3387 612 5169 43 49315 398 1215 90 414 3972 25439 10975 2619 49659 45152 288 49423 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10/25/2021 14:25:50 - INFO - __main__ -   ***** Running training *****
10/25/2021 14:25:50 - INFO - __main__ -     Num examples = 14169
10/25/2021 14:25:50 - INFO - __main__ -     Num Epochs = 5
10/25/2021 14:25:50 - INFO - __main__ -     Instantaneous batch size per GPU = 8
10/25/2021 14:25:50 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 8
10/25/2021 14:25:50 - INFO - __main__ -     Gradient Accumulation steps = 1
10/25/2021 14:25:50 - INFO - __main__ -     Total optimization steps = 8860

epoch 0 loss 0.43646: 100%|█████████▉| 1771/1772 [04:39<00:00,  6.43it/s]10/25/2021 14:30:55 - INFO - __main__ -     eval_loss = 0.433
10/25/2021 14:30:55 - INFO - __main__ -     eval_acc = 0.8654
10/25/2021 14:30:55 - INFO - __main__ -     ********************
10/25/2021 14:30:55 - INFO - __main__ -     Best acc:0.8654
10/25/2021 14:30:55 - INFO - __main__ -     ********************
10/25/2021 14:31:01 - INFO - __main__ -   Saving model checkpoint to model.bin

epoch 1 loss 0.42285: 100%|█████████▉| 1771/1772 [04:41<00:00,  5.98it/s]10/25/2021 14:36:07 - INFO - __main__ -     eval_loss = 0.393
10/25/2021 14:36:07 - INFO - __main__ -     eval_acc = 0.8654

epoch 2 loss 0.41306: 100%|█████████▉| 1771/1772 [04:42<00:00,  6.44it/s]10/25/2021 14:41:15 - INFO - __main__ -     eval_loss = 0.4125
10/25/2021 14:41:15 - INFO - __main__ -     eval_acc = 0.8654

epoch 3 loss 0.39196: 100%|█████████▉| 1771/1772 [04:45<00:00,  6.23it/s]10/25/2021 14:46:25 - INFO - __main__ -     eval_loss = 0.4612
10/25/2021 14:46:25 - INFO - __main__ -     eval_acc = 0.8584

epoch 4 loss 0.35402: 100%|█████████▉| 1771/1772 [04:43<00:00,  6.28it/s]10/25/2021 14:51:34 - INFO - __main__ -     eval_loss = 0.4892
10/25/2021 14:51:34 - INFO - __main__ -     eval_acc = 0.8419

There I could see the model overfits to the training set. Please take a look at the log, and I will be more than happy to refer your advice.

Thanks,

guoday commented 3 years ago

The training log seems normal. Maybe you need to check whether you correctly reload the checkpoint. Another advice is to test your eval datasets and see whether you can get 0.8654 score.

mhyeonsoo commented 3 years ago

Okay. That seems great idea.

Let me check one thing before I test.

for the test arguments, are these the correct inputs?

    --output_dir=${output_dir} \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \

and the model reload the checkpoint from ${output_dir} which has model.bin file. Is it correct?

Thank you.

guoday commented 3 years ago

https://github.com/microsoft/CodeXGLUE/blob/ae1d06f5505b3f71b6e1be36ee26028f17c09994/Code-Code/Defect-detection/code/run.py#L542-L544

The model reloads from "${output_dir}/checkpoint-best-acc/model.bin"

microsoft / CodeXGLUE

[Code-Code] Defect Prediciton --> pretraining with custom datasets #78