microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.1k stars 1.04k forks source link

Actor loss nan and Resizing model embedding #922

Open ouyanmei opened 2 months ago

ouyanmei commented 2 months ago

The model I use is GPT-2 124M. When resizing model embeddings during the training of STF and RW, I often encounter issues where the generated answers consist entirely of zeros. This causes both the log probabilities and actor loss to become NaN (Not a Number). I have noticed that resizing the embeddings can lead to the generation of token IDs that exceed the vocabulary size. I suspect this may be contributing to the problem. However, when I don't resize the model's embeddings and train STF and RW, I do not experience this issue during RLHF training. I don't know why.

ouyanmei commented 2 months ago

Due to the vocabulary size of the GPT-2 124M model being 50257, resizing the model's embedding layer dimensions may result in new embeddings that exceed the original vocabulary range. This can lead to the generation of token IDs that go beyond 50256. In the context of Reinforcement Learning with Human Feedback (RLHF), if the generated token IDs exceed the vocabulary range, the logprobs for these out-of-range tokens may become extremely small, manifesting as outliers. This can lead to numerical instability during training, potentially resulting in NaN values. Clipping the logprobs may help mitigate this issue (some experiments have been conducted, but it has not yet been widely validated).