Inconsistent Reproduce Results

Kiki2049 commented 1 year ago

I use the Defect-detection in CodeXGLUE to attack fine-tuning CodeBert model according to README , the parameters are similar to:

# fine-tuing
python run.py \
    --output_dir=./adv_saved_models \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --do_train \
    --train_data_file=../preprocess/dataset/train.jsonl \
    --eval_data_file=../preprocess/dataset/valid.jsonl \
    --test_data_file=../preprocess/dataset/test.jsonl \
    --epoch 5 \
    --block_size 512 \
    --train_batch_size 24 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456  2>&1 | tee train.log

# attack
python gi_attack.py \
    --output_dir=./saved_models \
    --model_type=roberta \
    --tokenizer_name=microsoft/codebert-base-mlm \
    --model_name_or_path=microsoft/codebert-base-mlm \
    --csv_store_path ./attack_genetic_400_800.csv \
    --base_model=microsoft/codebert-base-mlm \
    --use_ga \
    --train_data_file=../preprocess/dataset/train_subs.jsonl \
    --eval_data_file=../preprocess/dataset/test_subs_400_800.jsonl \
    --test_data_file=../preprocess/dataset/test_subs.jsonl \
    --block_size 512 \
    --eval_batch_size 64 \
    --seed 123456  > attack_gi_400_800.log 2>&1 &

I completed the test on 2732 sets of test data, but it was found that ASR was higher than the data in the report: 53.62% in the report, and I got about 65.19%. The calculation method is to use the Successful items count / Total count in the log.

I'm wondering if this is a bug in my process or some experimental error. Thanks!

yangzhou6666 commented 1 year ago

Hi,

Thanks for your interest.

In your script, the result is only for "part of the dataset", i.e., 400_800. (to speed up we previously split the dataset into multiple chunks).

You may want to obtain the results on the full dataset.

Kiki2049 commented 1 year ago

Thanks for your answer.

I understand that you have chunked your dataset to run faster. And I also used the chunked data provided in the repositories to conduct all tests, and the results are as follows:

# 0_400
Example time cost:  0.0 min
ALL examples time cost:  42.73 min
Query times in this attack:  1
All Query times:  193268
Success rate:  0.5905172413793104
Successful items count:  137
Total count:  232
Index:  399

# 400_800
Example time cost:  0.0 min
ALL examples time cost:  71.4 min
Query times in this attack:  1
All Query times:  206987
Success rate:  0.6758893280632411
Successful items count:  171
Total count:  253
Index:  399

# 800_1200
Example time cost:  0.0 min
ALL examples time cost:  48.63 min
Query times in this attack:  1
All Query times:  142842
Success rate:  0.7283464566929134
Successful items count:  185
Total count:  254
Index:  399

# 1200_1600
>> ACC! i => ori (0.53700 => 0.52459)
>> SUC! B => U (0.52459 => 0.49986)
Example time cost:  0.04 min
ALL examples time cost:  56.49 min
Query times in this attack:  138
All Query times:  188646
Success rate:  0.6175298804780877
Successful items count:  155
Total count:  251
Index:  399

# 1600_2000
Example time cost:  0.56 min
ALL examples time cost:  45.47 min
Query times in this attack:  2586
All Query times:  166986
Success rate:  0.6370967741935484
Successful items count:  158
Total count:  248
Index:  399

# 2000_2400
Example time cost:  0.07 min
ALL examples time cost:  57.91 min
Query times in this attack:  200
All Query times:  167304
Success rate:  0.6436781609195402
Successful items count:  168
Total count:  261
Index:  399

# 2400_2800
Example time cost:  0.05 min
ALL examples time cost:  49.66 min
Query times in this attack:  194
All Query times:  145886
Success rate:  0.6666666666666666
Successful items count:  146
Total count:  219
Index:  331

This result seems to be better than the log in the dataset_and_results.zip provided by you. So I'm worried if I'm doing something wrong.

A noteworthy point is that in README, the parameters used by fine_tune is train_data_file=../preprocess/dataset/adv_train.jsonl. I'd venture to guess that this is a typo, so train_data_file=../preprocess/dataset/train.jsonl is used.

yangzhou6666 commented 1 year ago

One possibility is that the genetic algorithm, which is random by nature, may lead to different results in each run, and even on different machines.

Kiki2049 commented 1 year ago

Thanks for your answer.

soarsmu / attack-pretrain-models-of-code

Inconsistent Reproduce Results #79