Closed K-Mistele closed 3 months ago
Hi @K-Mistele! This is actually a known issue that we recently debugged and is actually not specific to Ludwig!
The best way to solve it is to set bnb_4bit_compute_dtype
in the quantisation section of the Ludwig config to bfloat16
instead of float16
since batch sizes of > 1 with mistral in particular lead to bit overflows during training resulting in NaN loss during the first backprop in the train loop.
However, I notice you're training on a V100 and I don't think bfloat16 is supported since it only works on ampere architectures and above? Is there any chance you can use a newer Nvidia GPU?
The only Nvidia GPU that supports the bfloat16
is the A100
which I do not have access to. My v100 is an owned GPU not a rented/cloud one, so I try and stick with that whenever possible since I'm not paying by the hour.
@K-Mistele that makes sense! Actually the entire A series uses Ampere, so you could consider an A5000 from AWS which is pretty cheap. I might also suggest giving the Predibase free trial a try since we have A5000s/A6000s etc (A10Gs) for fine-tuning and we have $25 in free trial credits!
I am planning to I just want to make sure I can use the tool locally first
is there no workaround for a v100?
Unfortunately, not to my knowledge with Mistral. Do you want to test Llama-2-7B instead? The issue doesn't show up there with larger batch sizes!
yeah I can try it
@K-Mistele let me know how it goes!
Do you know if zephyr has the same problem @arnavgarg1 ?
@K-Mistele not to my knowledge!
@K-Mistele Did the fix work?
Closing since there has been a work-around suggested. In summary: bnb_4bit_compute_dtype
must be set to bfloat16
; however, that requires the Ampere NVIDIA GPU architecture (e.g.., A100) or newer.
Describe the bug
When I set a
trainer.batch_size
of > 1 or auto, my loss value is alwaysNaN
, and training will fail and exit at the end of the first epoch. Settingbatch_size
to 1 fixes the issue, but results in very inefficient GPU utilization for more powerful GPUs.To Reproduce
Steps to reproduce the behavior:
Do LoRA training with a
trainer.batch_size
ofauto
or >= 1:Expected behavior
I would expect a non-
NaN
loss value.Screenshots
Environment (please complete the following information):
Additional context GPU: 1x Tesla v100 32GB