Closed binsson closed 11 months ago
单卡能运行吗?
单卡能运行吗?
单卡可以,多卡尝试了小的数据集也是一样的报错
感觉是 ChatGLM3 对多卡的支持有点问题,试试 Deepspeed 运行?
感觉是 ChatGLM3 对多卡的支持有点问题,试试 Deepspeed 运行?
好的,我试一试
感觉是 ChatGLM3 对多卡的支持有点问题,试试 Deepspeed 运行?
好的,我试一试
请问解决了没?
感觉是 ChatGLM3 对多卡的支持有点问题,试试 Deepspeed 运行?
已经试过ds=0.10.1和ds=0.13.4,都不行,但是chatglm2就可以
Reminder
Reproduction
accelerate launch src/train_bash.py \ --stage sft \ --model_name_or_path ../chatglm3-6b \ --do_train \ --dataset babycare \ --template default \ --finetuning_type lora \ --lora_target query_key_value \ --output_dir sft_checkpoint \ --overwrite_cache \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 10 \ --plot_loss \ --fp16
多卡微调参数如上,每次都会产生下面的报错,是什么原因: Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 23.65 GiB total capacity; 22.93 GiB already allocated; 180.06 MiB free; 22.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Expected behavior
No response
System Info
No response
Others
No response