soarsmu / attack-pretrain-models-of-code

Replication Package for "Natural Attack for Pre-trained Models of Code", ICSE 2022
MIT License
41 stars 9 forks source link

cached_valid.pkl is not found #78

Closed eghannoum closed 1 year ago

eghannoum commented 1 year ago

Hello,

When I run this command python get_substitutes.py \ --store_path ./data_folder/processed_gcjpy/valid_subs.jsonl \ --base_model=microsoft/codebert-base-mlm \ --eval_data_file=./data_folder/processed_gcjpy/valid.txt \ --block_size 512

in the /attack-pretrain-models-of-code/CodeXGLUE/Authorship-Attribution/dataset directory I got this error FileNotFoundError: [Errno 2] No such file or directory: './data_folder/processed_gcjpy/cached_valid.pkl'

yangzhou6666 commented 1 year ago

Hi, many thanks for your interest in the repo.

I think here is the problem.

the file ./data_folder/processed_gcjpy/cached_valid.pkl is only created after fine-tuning the model the model using run.py

However, in the (unclear) documentation I wrote, if someone directly reuses our model without fune-tuning, this file won't be created and cause an error when running python get_substitutes.py

I think you may want to fine-tuning the model on the python dataset first, as documented in https://github.com/soarsmu/attack-pretrain-models-of-code/tree/main/CodeXGLUE/Authorship-Attribution

cd code
CUDA_VISIBLE_DEVICES=4,6 python run.py \
    --output_dir=./saved_models/gcjpy \
    --model_type=roberta \
    --config_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --tokenizer_name=roberta-base \
    --number_labels 66 \
    --do_train \
    --train_data_file=../dataset/data_folder/processed_gcjpy/train.txt \
    --eval_data_file=../dataset/data_folder/processed_gcjpy/valid.txt \
    --test_data_file=../dataset/data_folder/processed_gcjpy/valid.txt \
    --epoch 30 \
    --block_size 512 \
    --train_batch_size 16 \
    --eval_batch_size 32 \
    --learning_rate 5e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456 2>&1| tee train_gcjpy.log
yangzhou6666 commented 1 year ago

be free to let me know if you have any further questions :)

yangzhou6666 commented 1 year ago

Since there is no further question after a month, I will close this issue. :)