Closed gloryyoung closed 1 year ago
复现了,我明天看看咋解决
尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:
torch_dtype=float16
, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;torch_dtype=float32
,遇到问题expected scalar type Half but found Float, 参考 https://github.com/mymusise/ChatGLM-Tuning/issues/179 解决,这个设置等同于入参`torch_dtype=float16
,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:
- 设置
torch_dtype=float16
, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;- 改变方法,当前我显存多,手动设置
torch_dtype=float32
,遇到问题expected scalar type Half but found Float, 参考 expected scalar type Half but found Float mymusise/ChatGLM-Tuning#179 解决,这个设置等同于入参`torch_dtype=float16
,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。
多谢大佬!!我试试
请教一下,改成float32以后即使是4块A100 80GB也训不了 chatglm2,这是正常的吗?
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 pretraining.py \
--model_type chatglm \
--model_name_or_path THUDM/chatglm2-6b \
--train_file_dir ./data/pretrain \
--validation_file_dir ./data/pretrain \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--do_train \
--do_eval \
--use_peft False \
--seed 42 \
--max_train_samples 10000 \
--max_eval_samples 10 \
--num_train_epochs 0.5 \
--learning_rate 2e-4 \
--warmup_ratio 0.05 \
--weight_decay 0.01 \
--logging_strategy steps \
--logging_steps 10 \
--eval_steps 50 \
--evaluation_strategy steps \
--save_steps 500 \
--save_strategy steps \
--save_total_limit 3 \
--gradient_accumulation_steps 1 \
--preprocessing_num_workers 1 \
--block_size 1024 \
--output_dir outputs-pt-v1 \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--target_modules all \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--torch_dtype float16 \
--device_map auto \
--report_to tensorboard \
--ddp_find_unused_parameters False \
--gradient_checkpointing True
tting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
main()
File "/workspace/MedicalGPT/pretraining.py", line 635, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
self.optimizer.step(closure)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
Traceback (most recent call last):
File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 1; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
main()
File "/workspace/MedicalGPT/pretraining.py", line 635, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
self.optimizer.step(closure)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 2; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
main()
File "/workspace/MedicalGPT/pretraining.py", line 635, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
self.optimizer.step(closure)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 0; 79.19 GiB total capacity; 75.50 GiB already allocated; 253.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%| | 0/8 [00:15<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2079) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-07-28_04:38:59
host : 80b04983b729
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2080)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-07-28_04:38:59
host : 80b04983b729
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2081)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-07-28_04:38:59
host : 80b04983b729
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2082)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-28_04:38:59
host : 80b04983b729
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2079)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@80b04983b729:/workspace/MedicalGPT#
可以--use_peft True训
对于chatglm-6b的这种小size的模型来说,大家不用纠结于一定要用全参训练,其实lora训练效果跟全参相比并不差,部分参数调整合适,效果还更好,另外,还能减少样本量小时的过拟合情况。
这里附上chatglm-6b官方的微调效果比对:https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning
请教一下,改成float32以后即使是4块A100 80GB也训不了 chatglm2,这是正常的吗?
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 pretraining.py \ --model_type chatglm \ --model_name_or_path THUDM/chatglm2-6b \ --train_file_dir ./data/pretrain \ --validation_file_dir ./data/pretrain \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft False \ --seed 42 \ --max_train_samples 10000 \ --max_eval_samples 10 \ --num_train_epochs 0.5 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 1 \ --block_size 1024 \ --output_dir outputs-pt-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True tting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/workspace/MedicalGPT/pretraining.py", line 663, in <module> main() File "/workspace/MedicalGPT/pretraining.py", line 635, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step self.optimizer.step(closure) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper Traceback (most recent call last): File "/workspace/MedicalGPT/pretraining.py", line 663, in <module> out = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step state["exp_avg_sq"] = torch.zeros_like(p) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 1; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF main() File "/workspace/MedicalGPT/pretraining.py", line 635, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step self.optimizer.step(closure) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step state["exp_avg_sq"] = torch.zeros_like(p) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 2; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/workspace/MedicalGPT/pretraining.py", line 663, in <module> main() File "/workspace/MedicalGPT/pretraining.py", line 635, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step self.optimizer.step(closure) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step state["exp_avg_sq"] = torch.zeros_like(p) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 0; 79.19 GiB total capacity; 75.50 GiB already allocated; 253.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/8 [00:15<?, ?it/s] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2079) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ pretraining.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-07-28_04:38:59 host : 80b04983b729 rank : 1 (local_rank: 1) exitcode : 1 (pid: 2080) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-07-28_04:38:59 host : 80b04983b729 rank : 2 (local_rank: 2) exitcode : 1 (pid: 2081) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-07-28_04:38:59 host : 80b04983b729 rank : 3 (local_rank: 3) exitcode : 1 (pid: 2082) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-07-28_04:38:59 host : 80b04983b729 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2079) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ root@80b04983b729:/workspace/MedicalGPT#
我用四张3090,在dtype为bfloat16下,可以跑,模型用的一代的ChatGLM-6B 你可以把torchrun去掉试试,因为加上的话就是数据并行了,每张卡都要加载完整模型的
请问如果是多机多卡的代码,如何把他改成不用数据并行的代码。两机两卡的A100跑不了数据并行 node_rank=$1 echo ${node_rank} master_addr="10.111.112.223"
torchrun --nproc_per_node 8 --nnodes 2 --master_addr ${master_addr} --master_port 14545 --node_rank ${node_rank} run_supervised_finetuning.py ...
对于chatglm-6b的这种小size的模型来说,大家不用纠结于一定要用全参训练,其实lora训练效果跟全参相比并不差,部分参数调整合适,效果还更好,另外,还能减少样本量小时的过拟合情况。
这里附上chatglm-6b官方的微调效果比对:https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning
大佬请问一下四块A100 40GB的gpu可以以bf16全参数微调llama-7b吗?我发现fp16会出现题目中的问题,但是我把精度设置为bf16后就会出现OOM,即使我把batch_size设置为1了。
@gloryyoung 我也是把torchrun去掉,几张卡都可以,一加上两张卡都不行
@tszslovewanpu @gloryyoung 大佬,可以给一个你们的config吗,去掉trochrun的config
@xingenju CUDA_VISIBLE_DEVICES=0,1 python supervised_finetuning.py \ --model_type your_model \ --model_name_or_path PATH \ --train_file_dir DIR \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft True \ --fp16 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --warmup_ratio 0.05 \ --weight_decay 0.05 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 6 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 4 \ --output_dir DIR \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache
尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:
- 设置
torch_dtype=float16
, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;- 改变方法,当前我显存多,手动设置
torch_dtype=float32
,遇到问题expected scalar type Half but found Float, 参考 expected scalar type Half but found Float mymusise/ChatGLM-Tuning#179 解决,这个设置等同于入参`torch_dtype=float16
,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。
llama2做sft,也遇到了这个问题,训练了几十个step之后loss就变成0了,torch_dtype=float32还是不管用,显卡是8张A100。
bug描述
使用仓库自带数据集(天龙八部),对ChatGLM-6B进行全参数预训练loss很快变为0,eval_loss = NAN.
CUDA_VISIBLE_DEVICES=0,1,2,3 python pretraining.py \ --model_type chatglm \ --model_name_or_path ./chatglm-6b \ --train_file_dir ./data/pretrain \ --validation_file_dir ./data/pretrain \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft False \ --seed 42 \ --num_train_epochs 1 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 1 \ --block_size 1024 \ --output_dir outputs-pt-v2 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype bfloat16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True
并且使用gradio测试该二次预训练完成的模型时同样报错:
应该是二次预训练过程出现了问题