Closed SivilTaram closed 10 months ago
Can you share more details including the error message?
I ran the script on a single node using the following command. The training loss will not descent. The evaluated gradient norm is zero.
lightning run model \
--node-rank=0 \
--accelerator=cuda \
--devices=1 \
--num-nodes=1 \
pretrain/tinyllama.py --devices 1 --train_data_dir data/slim_star --val_data_dir data/slim_star
Gradient norm:
if not is_accumulating:
total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=grad_clip)
# Print the gradient norm
print("Gradient norm: ", total_norm)
fabric.clip_gradients(model, optimizer, max_norm=grad_clip)
lit-gpt also has this issue: https://github.com/Lightning-AI/lit-gpt/issues/689 Probably an issue with lightning Fabric, or some toggles need to be on. (or should I call it an issue? Cz Fabric is designed for multi-gpu)
Update: @ChaosCodes managed to reproduce the error on a single GPU.
with fabric.no_ backward sync(model,enabled=is accumulating): logits = model(input ids)
The logits in the line produce 0 always produce 0.
Hi, if you want you train your code with 1 gpu, you perhaps need to set empty_init=Flase
here. But I have no idea yet why it will lead to the gradient problem.
This should fix the problem for now https://github.com/jzhang38/TinyLlama/commit/782f1824dd6dae05adb4dcf1d784259006a9b1f4
First thanks for all of the efforts you have done for the TinyLlama project - it's awesome!
Recently I found a spurious problem. When there is only one card, the training gradient would disappear. Is it expected? Is the model training on one card currently not supported?
Thanks!