Loss drops and stays at zero

BoboMuller commented 3 months ago

Currently I am working on a medium size project where I need to fine-tune several models. Right now I am training DeepSeekCoder 6.7b and while training I noticed a strange behavior, where the training loss drops directly to 0 and stays there, no matter how long the training is continued.

The dataset used contains 10k samples. Each sample results in 1024 tokens without the need of a padding. The dataset was deduplicated using MinHash with a threshold of 0.7.

I use rslora in each linear layers except the head, because the goal is to improve the performance of another lesser known programming language. Quantization: 4bit Learning rate: 2e-5 Lr scheduler: Constant and Cosine tried out Rank: 126 and alpha of 16 (since more would not fit into the VRAM) Batch size: Tried 16 and 32 Lora dropout: 0.1 Weight decay: 0.01 Max grad norm: 0.3 (which might be overblown, not that I think about it)

When using a batch size of 16 it takes around 114 steps to go from a loss of 1.2 to 0.8 and then it drops to 0.0 When using a batch size of 32 it takes around 57 steps to go from a loss of 1.2 to 1.0 and then it drops to 0.0 again

In another experiment I tried to lower the rank to 64 where the first 200 steps didn't result in a loss of 0.0, but with that approach the loss did not improve significantly anymore.

Might this be a problem with the loss calculation of the library, or am I using it wrong?

danielhanchen commented 3 months ago

Try adding a validation dataset to see if the val loss is actually decreasing - a loss of 0 could mean NaN gradients (try decreasing learning rate to 2e-6 for eg), or it could mean overfitting

BoboMuller commented 3 months ago

Hi, thanks for your response. I made several experiments and indeed, with a learning rate of 2e-5 at some point the grad_norm metric is going up very fast. With a different learning rate of 4e-6 (and the default for max_grad_norm) I encounter the same behaviour, but now the situation is even stranger.

The actual values look like this. Here we see that in step 268 the train loss spiked down and the grad_norm goes to NaN all of a sudden.

Step, Loss, grad_norm 263 0.9913 0.160921
264 Evaluation step with loss: 1.015229 265 1.0015 0.146931
266 1.0656 0.186978
267 0.9506 0.152463
268 0.8927 NaN 269 Evaluation step with loss: NaN 270 0.000 NaN

danielhanchen commented 2 months ago

Oh my if the eval loss goes to NaN, it means it diverged. Set max_grad_norm = 0.3 in the training args

BoboMuller commented 2 months ago

I was under the assumption that the eval_loss goes to NaN, because the grad_norm went to NaN in the step before. I will give it a try, after trying to increase the weigh_decay since this might be another reason for divergence. Thank you for your response

BoboMuller commented 2 months ago

Somehow this situation got stranger. I noticed that there is a special token which I do not want to train, therefore I remove it using this code:

def remove_name_tags(example): example["content"] = example["content"].replace("", "") return example data["train"] = data["train"].map(remove_name_tags) (This train column is later split into train and test)

Now the gradient often reaches a NaN after the first step, and if it does not, then the first validation is guaranteed to be a NaN. I reduced the model complexity further and reduced the lr to no avail

BoboMuller commented 2 months ago

In the past two weeks I haven't had much time to solve the issue. Yesterday however, I noticed, that I forgot to set the fp16 variable to True. The reason for instability in training therefore can't be more obvious. Thank you for this awesome project and your help.

danielhanchen commented 2 months ago

Oh great you solved it!

unslothai / unsloth

Loss drops and stays at zero #712