Unable to run finetune script for CodeReviewer Quality Estimation task

sergiogrz commented 1 year ago

Hello,

I have found your CodeReviewer project really interesting, so I would like to learn how to use it and try it. Following the README instructions I have downloaded the dataset from Zenodo and the pre-trained checkpoint from Huggingface, and adjusted some path arguments in the finetune-cls.sh script.

My problem is, when I try to run the file, I receive the following error message:

MASTER_HOST: localhost
MASTER_PORT: 23333
RANK: 0
PER_NODE_GPU: 1
WORLD_SIZE: 1
NODES: 1
[nltk_data] Downloading package punkt to
[nltk_data]     /home/sgrodriguez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
03/29/2023 08:48:27 - INFO - __main__ -   Namespace(task=None, model_type='codet5', add_lang_ids=False, from_scratch=False, debug=False, start_epoch=0, train_epochs=30, tokenizer_path=None, output_dir='../../save/cls', load_model_path=None, model_name_or_path='../../save/cls/checkpoints-last/', train_path=None, eval_chunkname=None, train_filename='../../dataset/Diff_Quality_Estimation', dev_filename='../../dataset/Diff_Quality_Estimation/cls-valid.jsonl', test_filename=None, gold_filename=None, config_name='Salesforce/codet5-base', max_source_length=512, max_target_length=128, do_train=False, do_eval=False, do_test=False, raw_input=False, do_lower_case=False, no_cuda=False, train_batch_size=12, eval_batch_size=8, gradient_accumulation_steps=3, learning_rate=0.0003, mask_rate=0.15, beam_size=6, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, save_steps=3600, log_steps=100, eval_steps=-1, eval_file='', out_file='', break_cnt=-1, train_steps=120000, warmup_steps=100, gpu_per_node=1, node_index=0, local_rank=-1, seed=2233, cpu_count=8)
03/29/2023 08:48:27 - INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:1 to store for rank: 0
03/29/2023 08:48:27 - INFO - torch.distributed.distributed_c10d -   Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
03/29/2023 08:48:27 - WARNING - __main__ -   Process rank: 0, global rank: 0, world size: 1, bs: 12
Some weights of ReviewerModel were not initialized from the model checkpoint at ../../save/cls/checkpoints-last/ and are newly initialized: ['cls_head.weight', 'cls_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
03/29/2023 08:48:28 - INFO - models -   Finish loading model [223M] from ../../save/cls/checkpoints-last/
Traceback (most recent call last):
  File "/home/sgrodriguez/CodeReviewer/code/sh/../run_finetune_cls.py", line 302, in <module>
    main(args)
  File "/home/sgrodriguez/CodeReviewer/code/sh/../run_finetune_cls.py", line 124, in main
    model.load_state_dict(
  File "/home/sgrodriguez/miniconda3/envs/code_reviewer/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ReviewerModel:
        Missing key(s) in state_dict: "cls_head.weight", "cls_head.bias". 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10743) of binary: /home/sgrodriguez/miniconda3/envs/code_reviewer/bin/python
Traceback (most recent call last):
  File "/home/sgrodriguez/miniconda3/envs/code_reviewer/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
  File "/home/sgrodriguez/miniconda3/envs/code_reviewer/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/sgrodriguez/miniconda3/envs/code_reviewer/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/sgrodriguez/miniconda3/envs/code_reviewer/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/sgrodriguez/miniconda3/envs/code_reviewer/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sgrodriguez/miniconda3/envs/code_reviewer/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../run_finetune_cls.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------

Could you please lend me a hand with this issue? Am I missing something?

Thanks in advance.

celbree commented 1 year ago

Please check if you have modified the model loading part and the model path. When you first run the fine-tune script, it will handle such error in https://github.com/microsoft/CodeBERT/blob/master/CodeReviewer/code/models.py#L199. While In your log, it seems you are loading checkpoint-last, which should not be expected when you fine-tune the model from the pre-trained model.

sergiogrz commented 1 year ago

Thanks for your response. I'm not sure if I got it right. I've run it again, but modifying the --model_name_or_path argument back to microsoft/codereviewer. But I got the same error when running finetune-cls.sh with the following arguments:

torchrun --nproc_per_node ${PER_NODE_GPU} --node_rank=${RANK} --nnodes=${NODES} --master_addr=${MASTER_HOST} --master_port=${MASTER_PORT} ../run_finetune_cls.py  \
  --train_epochs 30 \
  --model_name_or_path microsoft/codereviewer \
  --output_dir ../../save/cls \
  --train_filename ../../dataset/Diff_Quality_Estimation \
  --dev_filename ../../dataset/Diff_Quality_Estimation/cls-valid.jsonl \
  --max_source_length 512 \
  --max_target_length 128 \
  --train_batch_size 12 \
  --learning_rate 3e-4 \
  --gradient_accumulation_steps 3 \
  --mask_rate 0.15 \
  --save_steps 3600 \
  --log_steps 100 \
  --train_steps 120000 \
  --gpu_per_node=${PER_NODE_GPU} \
  --node_index=${RANK} \
  --seed 2233

celbree commented 1 year ago

Could you check if the path ../../save/cls/checkpoints-last/ exists? It is not supposed to exist. If yes, please delete (or backup to another place if you need it) it and run fine-tuning again.

sergiogrz commented 1 year ago

Ahhh, that's it! I had wrongly downloaded the model checkpoint from Huggingface and placed it in ../../save/cls/checkpoints-last/. Thank you so much @celbree !

microsoft / CodeBERT

Unable to run finetune script for CodeReviewer Quality Estimation task #240