Closed Aillian closed 2 months ago
@Aillian Based on the screenshot it looks to be stuck downloading the Llama model. Did the download every complete?
Yes, based on the progress bar output from 'AutoModelForCausalLM.from_pretrained' it shows that the model download is completed...
Update: when i train on a single gpu i get: "AttributeError: 'NoneType' object has no attribute 'backward' "
Maybe that would help in debugging the issue
I am facing the same issue! Help please
I have encountered the same issue with no luck fixing it, any help would be appreciated.
It gets past the stuck point on a single GPU?
same problem and trys NCCL_P2P_DISABLE=1 as well ,but nothing changed, debug find ,it hangs at any help would be appreciated. PS : i use qwen project for 2-nodes and 4-gpu training. torch 1.13.1 tqdm 4.66.1 transformers 4.32.0 deepspeed 0.11.1 accelerate 0.24.0
I just started encountering this same issue. It only happens when I try to finetune a Llama2 derivative but not when I finetune Mistral or Zephyr. Hmmm.
Traceback (most recent call last): | 0/595 [00:00<?, ?it/s]
File "/home/matt/topics/finetune.py", line 23, in <module>
trainer.train()
File "/home/matt/lora/lora_trainer.py", line 115, in train
self.trainer.train()
File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 290, in train
output = super().train(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/accelerate/utils/memory.py", line 136, in decorator
return function(batch_size, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
self.accelerator.backward(loss)
File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/accelerate/accelerator.py", line 1980, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'backward'
Python 3.11.4 CUDA 12.1.1 torch 2.1.1 transformers 4.35.2 peft 0.6.2 trl 0.7.4 accelerate 0.24.1 deepspeed 0.12.3
A workaround appears to be disabling auto_find_batch_size
. 🙂
Could anyone provide an update or a solution to the issue at hand?
Any updates regarding this issue?
Very poor support for this repo, very disappointed...
Based on my experience, I suggest deleting the folder .cache/torch_extensions/py310_cu116
and then reinstalling deepspeed.
Same issue, only can work when set Zero0, but Zero2 and 3 failed
First, I deleted ~/.cache/torch_extensions/py310_cu116. it worked for a moment and hangs at anather place. Then I reinstall deepspeed, and now it works.
same here, my training got stuck after 4 steps, and showing this info:
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
The bug: script just hangs when it starts training the model after loading cpu_adam op...
i have noticed that the same issue is happening to many people, i tried many solutions...
2176 suggests to
console NCCL_P2P_DISABLE=1
3416 suggests to
rm -rf /home/ga2530/.cache/torch_extensions/py310_cu116
4285 suggests to change
TORCH_EXTENSIONS_DIR
environment variablebut nothing works...
Here is the script output:
Here is the
deepspeed_config.yaml
file:Here is the
ds_config.json
file:Here is my code:
Here is my launcher:
OR
Expected behavior: Model starts to train
ds_report output:
System info:
Launcher: Both
deepspeed
andaccelerate
accelauncher