finetune-msg.sh no step to generate checkpoints?

deneschen commented 1 year ago

为什么这个脚本finetune-msg.sh没有生成checkpoints的step，而是直接finished了是我哪里修改问题吗？

cat finetune-msg.sh
# batch size 6 for 16 GB GPU

mnt_dir="/home/codereview"

# You may change the following block for multiple gpu training
MASTER_HOST=localhost && echo MASTER_HOST: ${MASTER_HOST}
MASTER_PORT=23333 && echo MASTER_PORT: ${MASTER_PORT}
RANK=0 && echo RANK: ${RANK}
PER_NODE_GPU=1 && echo PER_NODE_GPU: ${PER_NODE_GPU}
WORLD_SIZE=1 && echo WORLD_SIZE: ${WORLD_SIZE}
NODES=1 && echo NODES: ${NODES}
NCCL_DEBUG=INFO

# bash test_nltk.sh

# Change the arguments as required:
#   model_name_or_path, load_model_path: the path of the model to be finetuned
#   eval_file: the path of the evaluation data
#   output_dir: the directory to save finetuned model (not used at infer/test time)
#   out_file: the path of the output file
#   train_filename: can be a directory contraining files named with "train*.jsonl"
#   raw_input: to select the preprocess method, set to True in this task

python -m torch.distributed.launch --nproc_per_node ${PER_NODE_GPU} --node_rank=${RANK} --nnodes=${NODES} --master_addr=${MASTER_HOST} --master_port=${MASTER_PORT} ../run_finetune_msg.py  \
  --train_epochs 30 \
  --model_name_or_path ../../../../codereviewer \
  --output_dir ../../../../save/gen \
  --train_filename ../../../../dataset/Comment_Generation/ \
  --dev_filename ../../../../dataset/Comment_Generation/msg-valid.jsonl \
  --max_source_length 512 \
  --max_target_length 128 \
  --train_batch_size 6 \
  --learning_rate 3e-4 \
  --gradient_accumulation_steps 3 \
  --mask_rate 0.15 \
  --save_steps 1800 \
  --log_steps 100 \
  --train_steps 60000 \
  --gpu_per_node=${PER_NODE_GPU} \
  --node_index=${RANK} \
  --seed 2233 \
  --raw_input

下面是执行的日志：

bash ./finetune-msg.sh 2>&1 | tee finetune-msg.log &
[1] 854119
(py38) root@2677f89bf865:/workspace/OpenAI/CodeBERT/CodeReviewer/code/sh# MASTER_HOST: localhost
MASTER_PORT: 23333
RANK: 0
PER_NODE_GPU: 1
WORLD_SIZE: 1
NODES: 1
10/19/2023 12:27:38 - INFO - __main__ -   Namespace(adam_epsilon=1e-08, add_lang_ids=False, beam_size=6, break_cnt=-1, config_name='Salesforce/codet5-base', cpu_count=64, debug=False, dev_filename='../../../../dataset/Comment_Generation/msg-valid.jsonl', do_eval=False, do_lower_case=False, do_test=False, do_train=False, eval_batch_size=8, eval_chunkname=None, eval_file='', eval_steps=-1, from_scratch=False, gold_filename=None, gpu_per_node=1, gradient_accumulation_steps=3, learning_rate=0.0003, load_model_path=None, local_rank=0, log_steps=100, mask_rate=0.15, max_grad_norm=1.0, max_source_length=512, max_target_length=128, model_name_or_path='../../../../codereviewer', model_type='codet5', no_cuda=False, node_index=0, out_file='', output_dir='../../../../save/gen', raw_input=True, save_steps=1800, seed=2233, start_epoch=0, task=None, test_filename=None, tokenizer_path=None, train_batch_size=6, train_epochs=30, train_filename='../../../../dataset/Comment_Generation/', train_path=None, train_steps=60000, warmup_steps=100, weight_decay=0.0)
10/19/2023 12:27:39 - INFO - torch.distributed.distributed_c10d -   Added key: store_based_barrier_key:1 to store for rank: 0
10/19/2023 12:27:39 - INFO - torch.distributed.distributed_c10d -   Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
10/19/2023 12:27:39 - WARNING - __main__ -   Process rank: 0, global rank: 0, world size: 1, bs: 6
Some weights of ReviewerModel were not initialized from the model checkpoint at ../../../../codereviewer and are newly initialized: ['cls_head.bias', 'cls_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
10/19/2023 12:27:42 - INFO - models -   Finish loading model [223M] from ../../../../codereviewer
/workspace/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
10/19/2023 12:27:47 - INFO - __main__ -   Training finished.
/workspace/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
^C
[1]+  Done                    bash ./finetune-msg.sh 2>&1 | tee finetune-msg.log

我的目录结构是这样子的

(py38) root@2677f89bf865:/workspace/OpenAI/CodeBERT/CodeReviewer/code/sh# ls -l ../../../../save/gen
total 0
(py38) root@2677f89bf865:/workspace/OpenAI/CodeBERT/CodeReviewer/code/sh# ls -l ../../../../codereviewer
total 878388
-rw-rw-r-- 1 1001 1001       956 Aug 23 13:04 README.md
-rw-rw-r-- 1 1001 1001      1869 Aug 23 13:04 added_tokens.json
-rw-rw-r-- 1 1001 1001      2130 Aug 23 13:04 config.json
-rw-rw-r-- 1 1001 1001       168 Aug 23 13:04 generation_config.json
-rw-r--r-- 1 root root   3401786 Oct 15 06:55 golds.txt
-rw-rw-r-- 1 1001 1001    294364 Aug 23 13:04 merges.txt
-rw-r--r-- 1 root root   3160086 Oct 15 06:55 preds.txt
-rw-rw-r-- 1 1001 1001 892005683 Aug 23 13:11 pytorch_model.bin
-rw-rw-r-- 1 1001 1001       913 Aug 23 13:04 special_tokens_map.json
-rw-rw-r-- 1 1001 1001      1287 Aug 23 13:04 tokenizer_config.json
-rw-rw-r-- 1 1001 1001    575045 Aug 23 13:04 vocab.json
(py38) root@2677f89bf865:/workspace/OpenAI/CodeBERT/CodeReviewer/code/sh# ls -l ../../../../dataset/Comment_Generation/
total 4072956
-rw-r--r-- 1 1001 1001  275294476 Jul 25  2022 msg-test.jsonl
-rw-r--r-- 1 1001 1001 3621670365 Jul 25  2022 msg-train.jsonl
-rw-r--r-- 1 1001 1001  273738946 Jul 25  2022 msg-valid.jsonl
(py38) root@2677f89bf865:/workspace/OpenAI/CodeBERT/CodeReviewer/code/sh# ls -l ../../../../dataset/Comment_Generation/msg-valid.jsonl
-rw-r--r-- 1 1001 1001 273738946 Jul 25  2022 ../../../../dataset/Comment_Generation/msg-valid.jsonl

deneschen commented 1 year ago

发现你的代码里面写错了。。。和数据文件名字不匹配。

diff --git a/CodeReviewer/code/run_finetune_msg.py b/CodeReviewer/code/run_finetune_msg.py
index 865530b..d6cf954 100644
--- a/CodeReviewer/code/run_finetune_msg.py
+++ b/CodeReviewer/code/run_finetune_msg.py
@@ -169,7 +169,7 @@ def main(args):
     train_file = args.train_filename
     valid_file = args.dev_filename
     if os.path.isdir(train_file):
-        train_files = [file for file in os.listdir(train_file) if file.startswith("train") and file.endswith(".jsonl")]
+        train_files = [file for file in os.listdir(train_file) if file.startswith("msg-train") and file.endswith(".jsonl")]
     else:
         train_files = [train_file]
     random.seed(args.seed)

celbree commented 1 year ago

是的，如果你发现了代码中的bug，请提交pull request帮助我们修改，谢谢！

microsoft / CodeBERT

finetune-msg.sh no step to generate checkpoints? #298