[X] I have read the README and searched the existing issues.
Reproduction
RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.
Converting format of dataset (num_proc=4): 100%|███████████████████████████████████| 3000/3000 [00:00<00:00, 23927.57 examples/s]
/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
table = cls._concat_blocks(blocks, axis=0)
04/07/2024 20:36:05 - INFO - llmtuner.data.loader - Loading dataset glaive_toolcall_10k.json...
04/07/2024 20:36:05 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at ../../data/glaive_toolcall_10k.json.
Converting format of dataset (num_proc=4): 100%|███████████████████████████████████| 3000/3000 [00:00<00:00, 20282.00 examples/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148977 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148978 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148979 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 3148980) of binary: /home/dandan.song/anaconda3/envs/llama_factory_stable/bin/python
Traceback (most recent call last):
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1014, in launch_command
multi_gpu_launcher(args)
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
distrib_run.run(args)
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
../../src/train_bash.py FAILED
Reminder
Reproduction
RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_CUDA_DSA` to enable device-side assertions.Converting format of dataset (num_proc=4): 100%|███████████████████████████████████| 3000/3000 [00:00<00:00, 23927.57 examples/s] /home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'. table = cls._concat_blocks(blocks, axis=0) 04/07/2024 20:36:05 - INFO - llmtuner.data.loader - Loading dataset glaive_toolcall_10k.json... 04/07/2024 20:36:05 - WARNING - llmtuner.data.utils - Checksum failed: mismatched SHA-1 hash value at ../../data/glaive_toolcall_10k.json. Converting format of dataset (num_proc=4): 100%|███████████████████████████████████| 3000/3000 [00:00<00:00, 20282.00 examples/s] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148977 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148978 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3148979 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 3148980) of binary: /home/dandan.song/anaconda3/envs/llama_factory_stable/bin/python Traceback (most recent call last): File "/home/dandan.song/anaconda3/envs/llama_factory_stable/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1014, in launch_command
multi_gpu_launcher(args)
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
distrib_run.run(args)
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dandan.song/anaconda3/envs/llama_factory_stable/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
../../src/train_bash.py FAILED
Failures: