Gradient norm is zero for training Qwen2.5-0.5B-Instruct in unsloth=="2024.11.6"

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

18.23k stars 1.27k forks source link

Gradient norm is zero for training Qwen2.5-0.5B-Instruct in unsloth=="2024.11.6" #1282

Open joe32140 opened 5 days ago

joe32140 commented 5 days ago

Hi,

I encountered an issue after updating to unsloth=="2024.11.6". When training the Qwen2.5-0.5B-Instruct model without PEFT, I observed that the model's gradient norm is 0, resulting in no weight updates.

I noticed a discrepancy in the number of trainable parameters:

unsloth=="2024.11.6": 357,898,112 parameters
unsloth=="2024.10.7": 494,032,768 parameters (works correctly)

This difference in trainable parameters might be related to the training issue.

danielhanchen commented 4 days ago

Oh wait without PEFT? Hmm would it be possible for you to use with torch.autograd.set_detect_anomaly(True): trainer.train()