deepspeed多卡训练Mixtral，八张H800爆显存，求大神帮忙看看

hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)

Apache License 2.0

30.09k stars 3.71k forks source link

Reminder

[x] I have read the README and searched the existing issues.

Reproduction

deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port=9901 src/train_bash.py \ --deepspeed ds_config.json \ --stage sft \ --model_name_or_path /home/workspace/LLaMA-Factory-main/Mixtral-8x7B-v0.1 \ --do_train \ --dataset train \ --template mistral \ --finetuning_type lora \ --lora_target query_key_value \ --output_dir mixtral_sft \ --overwrite_cache \ --overwrite_output_dir \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 10000 \ --learning_rate 5e-5 \ --num_train_epochs 3 \ --num_layer_trainable 1 \ --plot_loss \ --fp16

Expected behavior

正常八张H800应该不会出现超出显存的情况吧，搜索了好多方法都没用，是因为我的配置出现什么问题了吗

System Info

报错信息如下：

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 6 has a total capa city of 79.11 GiB of which 100.69 MiB is free. Including non-PyTorch memory, this process has 79.00 G iB memory in use. Of the allocated memory 76.04 GiB is allocated by PyTorch, and 113.81 MiB is reserv ed by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_A LLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Loading checkpoint shards: 79%|█████████████████████████████▏ | 15/19 [00:42<00:10, 2.69s/it] [2024-03-20 17:56:30,593] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2756559 [2024-03-20 17:56:31,072] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2756560 [2024-03-20 17:56:31,587] [ERROR] [launch.py:322:sigkill_handler] ['/data/anaconda3/envs/mixtral_env/ bin/python', '-u', 'src/train_bash.py', '--local_rank=7', '--deepspeed', 'ds_config.json', '--stage', 'sft', '--model_name_or_path', '/home/workspace/LLaMA-Factory-main/Mixtral-8x7B-v0.1', '--do_train', '--dataset', 'train', '--template', 'mistral', '--finetuning_type', 'lora', '--loratarget', 'query key_value', '--output_dir', 'mixtral_sft', '--overwrite_cache', '--overwrite_output_dir', '--per_devi ce_train_batch_size', '2', '--gradient_accumulation_steps', '2', '--per_device_eval_batch_size', '2', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--save_steps', '10000', '--learning_rate' , '5e-5', '--num_train_epochs', '3', '--num_layer_trainable', '1', '--plot_loss', '--fp16'] exits wit h return code = 1

Others

ds_config.json文件如下： { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "reduce_scatter": true, "reduce_bucket_size": 5e8, "overlap_comm": false, "contiguous_gradients": true } }

感谢，这个问题已经解决了，不过还有一个无法解决的bug，能麻烦帮忙看一下是哪的问题吗？我在训练数据后用README中使用“合并 LoRA 权重并导出模型”的命令，但是会出现如下bug： Traceback (most recent call last): File "/home/LLaMA-Factory/src/export_model.py", line 9, in main() File "/home/LLaMA-Factory/src/export_model.py", line 5, in main export_model() File "/home/LLaMA-Factory/src/llmtuner/train/tuner.py", line 52, in export_model model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/LLaMA-Factory/src/llmtuner/model/loader.py", line 146, in load_model_and_tokenizer model = load_model(tokenizer, model_args, finetuning_args, is_trainable, add_valuehead) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/LLaMA-Factory/src/llmtuner/model/loader.py", line 94, in load_model model = init_adapter(model, model_args, finetuning_args, is_trainable) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/LLaMA-Factory/src/llmtuner/model/adapter.py", line 110, in init_adapter model: "LoraModel" = PeftModel.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/anaconda3/envs/mixtral_env/lib/python3.11/site-packages/peft/peft_model.py", line 353, in from_pretrained model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, kwargs) File "/data/anaconda3/envs/mixtral_env/lib/python3.11/site-packages/peft/peft_model.py", line 694, in load_adapter adapters_weights = load_peft_weights(model_id, device=torch_device, hf_hub_download_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/anaconda3/envs/mixtral_env/lib/python3.11/site-packages/peft/utils/save_and_load.py", line 326, in load_peft_weights adapters_weights = safe_load_file(filename, device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/anaconda3/envs/mixtral_env/lib/python3.11/site-packages/safetensors/torch.py", line 308, in load_file with safe_open(filename, framework="pt", device=device) as f: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization

我在网上查询了很多关于这个报错的信息，但是没有好的解决方案，并且我也用过很多不同的方法去合并模型，但是都会报这一个错误。下面是我的合并权重模型命令配置： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python src/export_model.py \ --model_name_or_path ./Mixtral-8x7B-v0.1 \ --adapter_name_or_path ./mixtral_sft \ --template default \ --finetuning_type lora \ --export_dir ../Mixtral_stf \ --export_size 2 \ --export_legacy_format False

hiyouga / LLaMA-Factory