L40s 4*48G ,lora CodeLlama-7b-hf-safetensors oom torch.distributed.elastic.multiprocessing.errors.ChildFailedError

hgcdanniel commented 6 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.

Converting format of dataset (num_proc=4): 100%|███████████████████████████████████| 3000/3000 [00:00<00:00, 23927.57 examples/s] /home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'. table = cls._concat_blocks(blocks, axis=0) 04/07/2024 20:36:05 - INFO - llmtuner.data.loader - Loading dataset glaive_toolcall_10k.json... 04/07/2024 20:36:05 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at ../../data/glaive_toolcall_10k.json. Converting format of dataset (num_proc=4): 100%|███████████████████████████████████| 3000/3000 [00:00<00:00, 20282.00 examples/s] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148977 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148978 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148979 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 3148980) of binary: /home/dandan.song/anaconda3/envs/llama_factory_stable/bin/python Traceback (most recent call last): File "/home/dandan.song/anaconda3/envs/llama_factory_stable/bin/accelerate", line 8, in sys.exit(main()) File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1014, in launch_command multi_gpu_launcher(args) File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher distrib_run.run(args) File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ../../src/train_bash.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2024-04-07_20:36:07 host : sttc-gpu-02 rank : 3 (local_rank: 3) exitcode : 1 (pid: 3148980) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ` `My script is` lora_multi_gpu/single_node.sh #!/bin/bash CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \ --config_file ../accelerate/single_config.yaml \ ../../src/train_bash.py \ --stage sft \ --do_train \ --model_name_or_path CodeLlama-7b-hf-safetensors \ --dataset alpaca_gpt4_en,glaive_toolcall \ --dataset_dir ../../data \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir ../../saves/LLaMA2-7B/lora/sft \ --overwrite_cache \ --overwrite_output_dir \ --cutoff_len 128 \ --preprocessing_num_workers 4 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --warmup_steps 20 \ --save_steps 100 \ --eval_steps 100 \ --evaluation_strategy steps \ --load_best_model_at_end \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --max_samples 3000 \ --val_size 0.1 \ --ddp_timeout 180000000 \ --plot_loss \ --fp16 my single_config.yaml is: `compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 main_process_ip: 192.168.0.1 main_process_port: 29013 num_machines: 1 # the number of nodes num_processes: 4 # the number of GPUs in all nodes rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false ~ ` ### Expected behavior 我采用deepspeed，或者accelerate,LORA 微调模型CodeLlama-7b-hf-safetensors,都会出现OOM，然后torch.distributed.elastic.multiprocessing.errors.ChildFailedError，按道理，这么小的模型用Lora不会oom? 难度lora的运行方式是先Load全参数，再微调部分参数？ ### System Info _No response_ ### Others _No response_

hiyouga commented 6 months ago

单卡能跑吗？

hgcdanniel commented 6 months ago

单卡可以，多卡失败，两个卡满负荷，两个卡起不起来，然后报O OM

hgcdanniel commented 6 months ago

单卡可以，多卡失败(共4卡），两个卡满负荷，两个卡起不起来，然后报O OM

ChenBinfighting1 commented 5 months ago

单卡可以，多卡失败(共4卡），两个卡满负荷，两个卡起不起来，然后报O OM

同问，请问这个问题后来解决了吗？感谢！

xxrjun commented 3 months ago

单卡可以，多卡失败(共4卡），两个卡满负荷，两个卡起不起来，然后报O OM

同問 🙏

hiyouga / LLaMA-Factory

L40s 4*48G ,lora CodeLlama-7b-hf-safetensors oom torch.distributed.elastic.multiprocessing.errors.ChildFailedError #3169

Reminder

Reproduction