shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.24k stars 492 forks source link

ChatGLM全参数二次预训练过程中,loss马上变为0,val_loss = nan #125

Closed gloryyoung closed 1 year ago

gloryyoung commented 1 year ago

bug描述

使用仓库自带数据集(天龙八部),对ChatGLM-6B进行全参数预训练loss很快变为0,eval_loss = NAN. image

CUDA_VISIBLE_DEVICES=0,1,2,3 python pretraining.py \ --model_type chatglm \ --model_name_or_path ./chatglm-6b \ --train_file_dir ./data/pretrain \ --validation_file_dir ./data/pretrain \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft False \ --seed 42 \ --num_train_epochs 1 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 1 \ --block_size 1024 \ --output_dir outputs-pt-v2 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype bfloat16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True

并且使用gradio测试该二次预训练完成的模型时同样报错: image

应该是二次预训练过程出现了问题

shibing624 commented 1 year ago

复现了,我明天看看咋解决

shibing624 commented 1 year ago

尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:

  1. 设置torch_dtype=float16, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;
  2. 改变方法,当前我显存多,手动设置torch_dtype=float32,遇到问题expected scalar type Half but found Float, 参考 https://github.com/mymusise/ChatGLM-Tuning/issues/179 解决,这个设置等同于入参`torch_dtype=float16,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。
gloryyoung commented 1 year ago

尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:

  1. 设置torch_dtype=float16, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;
  2. 改变方法,当前我显存多,手动设置torch_dtype=float32,遇到问题expected scalar type Half but found Float, 参考 expected scalar type Half but found Float mymusise/ChatGLM-Tuning#179 解决,这个设置等同于入参`torch_dtype=float16,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。

多谢大佬!!我试试

NaCloudAI commented 1 year ago

请教一下,改成float32以后即使是4块A100 80GB也训不了 chatglm2,这是正常的吗?

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 pretraining.py \
    --model_type chatglm \
    --model_name_or_path THUDM/chatglm2-6b \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --use_peft False \
    --seed 42 \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --num_train_epochs 0.5 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --block_size 1024 \
    --output_dir outputs-pt-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True

tting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 1; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 2; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 0; 79.19 GiB total capacity; 75.50 GiB already allocated; 253.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                                                                                           | 0/8 [00:15<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2079) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2080)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2081)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2082)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@80b04983b729:/workspace/MedicalGPT# 
shibing624 commented 1 year ago

可以--use_peft True训

shibing624 commented 1 year ago

对于chatglm-6b的这种小size的模型来说,大家不用纠结于一定要用全参训练,其实lora训练效果跟全参相比并不差,部分参数调整合适,效果还更好,另外,还能减少样本量小时的过拟合情况。

Xnip2023-07-28_13-20-12

这里附上chatglm-6b官方的微调效果比对:https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning

gloryyoung commented 1 year ago

请教一下,改成float32以后即使是4块A100 80GB也训不了 chatglm2,这是正常的吗?

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 pretraining.py \
    --model_type chatglm \
    --model_name_or_path THUDM/chatglm2-6b \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --use_peft False \
    --seed 42 \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --num_train_epochs 0.5 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --block_size 1024 \
    --output_dir outputs-pt-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True

tting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 1; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 2; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 0; 79.19 GiB total capacity; 75.50 GiB already allocated; 253.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                                                                                           | 0/8 [00:15<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2079) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2080)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2081)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2082)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@80b04983b729:/workspace/MedicalGPT# 

我用四张3090,在dtype为bfloat16下,可以跑,模型用的一代的ChatGLM-6B 你可以把torchrun去掉试试,因为加上的话就是数据并行了,每张卡都要加载完整模型的

zhr0313 commented 1 year ago

请问如果是多机多卡的代码,如何把他改成不用数据并行的代码。两机两卡的A100跑不了数据并行 node_rank=$1 echo ${node_rank} master_addr="10.111.112.223"

torchrun --nproc_per_node 8 --nnodes 2 --master_addr ${master_addr} --master_port 14545 --node_rank ${node_rank} run_supervised_finetuning.py ...

TomasAndersonFang commented 11 months ago

对于chatglm-6b的这种小size的模型来说,大家不用纠结于一定要用全参训练,其实lora训练效果跟全参相比并不差,部分参数调整合适,效果还更好,另外,还能减少样本量小时的过拟合情况。 Xnip2023-07-28_13-20-12

这里附上chatglm-6b官方的微调效果比对:https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning

大佬请问一下四块A100 40GB的gpu可以以bf16全参数微调llama-7b吗?我发现fp16会出现题目中的问题,但是我把精度设置为bf16后就会出现OOM,即使我把batch_size设置为1了。

tszslovewanpu commented 11 months ago

@gloryyoung 我也是把torchrun去掉,几张卡都可以,一加上两张卡都不行

xingenju commented 10 months ago

@tszslovewanpu @gloryyoung 大佬,可以给一个你们的config吗,去掉trochrun的config

tszslovewanpu commented 10 months ago

@xingenju CUDA_VISIBLE_DEVICES=0,1 python supervised_finetuning.py \ --model_type your_model \ --model_name_or_path PATH \ --train_file_dir DIR \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft True \ --fp16 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --warmup_ratio 0.05 \ --weight_decay 0.05 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 6 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 4 \ --output_dir DIR \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache

wangrx33 commented 8 months ago

尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:

  1. 设置torch_dtype=float16, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;
  2. 改变方法,当前我显存多,手动设置torch_dtype=float32,遇到问题expected scalar type Half but found Float, 参考 expected scalar type Half but found Float mymusise/ChatGLM-Tuning#179 解决,这个设置等同于入参`torch_dtype=float16,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。

llama2做sft,也遇到了这个问题,训练了几十个step之后loss就变成0了,torch_dtype=float32还是不管用,显卡是8张A100。