Closed eghannoum closed 1 year ago
Hi, many thanks for your interest in the repo.
I think here is the problem.
the file ./data_folder/processed_gcjpy/cached_valid.pkl
is only created after fine-tuning the model the model using run.py
However, in the (unclear) documentation I wrote, if someone directly reuses our model without fune-tuning, this file won't be created and cause an error when running python get_substitutes.py
I think you may want to fine-tuning the model on the python dataset first, as documented in https://github.com/soarsmu/attack-pretrain-models-of-code/tree/main/CodeXGLUE/Authorship-Attribution
cd code
CUDA_VISIBLE_DEVICES=4,6 python run.py \
--output_dir=./saved_models/gcjpy \
--model_type=roberta \
--config_name=microsoft/codebert-base \
--model_name_or_path=microsoft/codebert-base \
--tokenizer_name=roberta-base \
--number_labels 66 \
--do_train \
--train_data_file=../dataset/data_folder/processed_gcjpy/train.txt \
--eval_data_file=../dataset/data_folder/processed_gcjpy/valid.txt \
--test_data_file=../dataset/data_folder/processed_gcjpy/valid.txt \
--epoch 30 \
--block_size 512 \
--train_batch_size 16 \
--eval_batch_size 32 \
--learning_rate 5e-5 \
--max_grad_norm 1.0 \
--evaluate_during_training \
--seed 123456 2>&1| tee train_gcjpy.log
be free to let me know if you have any further questions :)
Since there is no further question after a month, I will close this issue. :)
Hello,
When I run this command
python get_substitutes.py \ --store_path ./data_folder/processed_gcjpy/valid_subs.jsonl \ --base_model=microsoft/codebert-base-mlm \ --eval_data_file=./data_folder/processed_gcjpy/valid.txt \ --block_size 512
in the /attack-pretrain-models-of-code/CodeXGLUE/Authorship-Attribution/dataset directory I got this error
FileNotFoundError: [Errno 2] No such file or directory: './data_folder/processed_gcjpy/cached_valid.pkl'