Closed l1905 closed 10 months ago
环境问题不太清楚,可能处理: 1.transformers降级版本;
同样的情况,请问解决了吗?
同样的情况,请问解决了吗?
value cannot be converted to type int without overflow
请问解决了嘛,遇到了相同的问题
百川模型用transformers==4.33.2
百川模型用transformers==4.33.2
我是在微调chatglm3-6b的时候出现的问题,而且transformers==4.33.2,请问还有可能是其他问题嘛
啥问题?我今天跑通chatglm3-6b多卡的。
啥问题?我今天跑通chatglm3-6b多卡的。 用torchrun模式两卡训练chatglm3-6b的时候就出现了以下错误:
File "supervised_finetuning.py", line 1325, in <module> main() File "supervised_finetuning.py", line 1286, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/root/paddlejob/workspace/env_run/llm/yes/envs/medicalGPT/lib/python3.8/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/root/paddlejob/workspace/env_run/llm/yes/envs/medicalGPT/lib/python3.8/site-packages/transformers/trainer.py", line 1682, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/root/paddlejob/workspace/env_run/llm/yes/envs/medicalGPT/lib/python3.8/site-packages/accelerate/accelerator.py", line 1202, in prepare result = tuple( File "/root/paddlejob/workspace/env_run/llm/yes/envs/medicalGPT/lib/python3.8/site-packages/accelerate/accelerator.py", line 1203, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/root/paddlejob/workspace/env_run/llm/yes/envs/medicalGPT/lib/python3.8/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/root/paddlejob/workspace/env_run/llm/yes/envs/medicalGPT/lib/python3.8/site-packages/accelerate/accelerator.py", line 1340, in prepare_model model = torch.nn.parallel.DistributedDataParallel( File "/root/paddlejob/workspace/env_run/llm/yes/envs/medicalGPT/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__ _verify_param_shape_across_processes(self.process_group, parameters) File "/root/paddlejob/workspace/env_run/llm/yes/envs/medicalGPT/lib/python3.8/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: value cannot be converted to type int without overflow
其中transformers==4.33.2,而且已经卸载了xformers和bitsandbytes。请问还有哪里可能有问题嘛
啥问题?我今天跑通chatglm3-6b多卡的。
请问您能分享一下您成功训练的scripts嘛,我的scripts是:
cd /root/paddlejob/workspace/env_run/llm/MedicalGPT
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 supervised_finetuning.py \
--model_type chatglm \
--model_name_or_path /root/paddlejob/workspace/env_run/llm/chatglm3-6b \
--train_file_dir ./data/finetune \
--validation_file_dir ./data/finetune \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--do_train \
--do_eval \
--use_peft True \
--fp16 \
--max_train_samples 1000 \
--max_eval_samples 10 \
--num_train_epochs 3 \
--learning_rate 2e-5 \
--warmup_ratio 0.05 \
--weight_decay 0.05 \
--logging_strategy steps \
--logging_steps 10 \
--eval_steps 50 \
--evaluation_strategy steps \
--save_steps 500 \
--save_strategy steps \
--save_total_limit 3 \
--gradient_accumulation_steps 1 \
--preprocessing_num_workers 4 \
--output_dir outputs-sft-chatglm3-genrec \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--target_modules all \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--torch_dtype float16 \
--device_map auto \
--report_to tensorboard \
--ddp_find_unused_parameters False \
--gradient_checkpointing True \
--cache_dir ./cache
而且我的base model换成vicuna-v1.5后也发生了同样的错误。
啥问题?我今天跑通chatglm3-6b多卡的。
请问您能分享一下您成功训练的scripts嘛,我的scripts是:
cd /root/paddlejob/workspace/env_run/llm/MedicalGPT CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 supervised_finetuning.py \ --model_type chatglm \ --model_name_or_path /root/paddlejob/workspace/env_run/llm/chatglm3-6b \ --train_file_dir ./data/finetune \ --validation_file_dir ./data/finetune \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --do_train \ --do_eval \ --use_peft True \ --fp16 \ --max_train_samples 1000 \ --max_eval_samples 10 \ --num_train_epochs 3 \ --learning_rate 2e-5 \ --warmup_ratio 0.05 \ --weight_decay 0.05 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 4 \ --output_dir outputs-sft-chatglm3-genrec \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache
而且我的base model换成vicuna-v1.5后也发生了同样的错误。
已经解决了,将accelerator更新到最新版本就行
Describe the bug
使用
CUDA_VISIBLE_DEVICES=1,2,3,4,5 python supervised_finetuning.py \
方式启动sft不会报错, 但是使用torchrun模式,提示value cannot be converted to type int without overflow
, 在两周之前,运行时没有报错, 这次是重新conda开启了新环境, 拉的新代码,出现了类似报错, 在晚上并没有找到类似的报错原因, 感觉像是依赖包更新到足迹新版导致的 启动命令:完整输出报错信息:
依赖包信息
python版本
环境信息
单卡40G显存
在sft baichuan2-7b, baichuan2-13b都有复现