[X] I have read the README and searched the existing issues.
System Info
latest llamafactory version
Reproduction
I'm using the latest llamafactory version to run sft(qlora+fsdp) for llama3.1 70B with 8xA100. It works with default optim, but with 8bit adam, i get the following error:
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
rank0: return inner_training_loop(
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/accelerate/optimizer.py", line 172, in step
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
rank0: return func.get(opt, opt.class)(*args, kwargs)
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
rank0: out = func(*args, *kwargs)
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(args, kwargs)
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 287, in step
rank0: self.update_step(group, p, gindex, pindex)
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, **kwargs)
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 546, in update_step
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1774, in optimizer_update_8bit_blockwise
rank0: prev_device = pre_call(g.device)
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/functional.py", line 463, in pre_call
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/cuda/init.py", line 418, in set_device
rank0: device = _get_device_index(device)
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index
rank0: raise ValueError(f"Expected a cuda device, but got: {device}")
rank0: ValueError: Expected a cuda device, but got: cpu
Reminder
System Info
latest llamafactory version
Reproduction
I'm using the latest llamafactory version to run sft(qlora+fsdp) for llama3.1 70B with 8xA100. It works with default optim, but with 8bit adam, i get the following error:
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train rank0: return inner_training_loop( rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/trainer.py", line 2341, in _inner_training_loop
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/accelerate/optimizer.py", line 172, in step
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper rank0: return func.get(opt, opt.class)(*args, kwargs) rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper rank0: out = func(*args, *kwargs) rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(args, kwargs) rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 287, in step rank0: self.update_step(group, p, gindex, pindex) rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, **kwargs) rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 546, in update_step
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1774, in optimizer_update_8bit_blockwise rank0: prev_device = pre_call(g.device) rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/functional.py", line 463, in pre_call
rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/cuda/init.py", line 418, in set_device rank0: device = _get_device_index(device) rank0: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index rank0: raise ValueError(f"Expected a cuda device, but got: {device}") rank0: ValueError: Expected a cuda device, but got: cpu
Expected behavior
No response
Others
No response