Actor loss nan and Resizing model embedding

microsoft / DeepSpeedExamples

Example models using DeepSpeed

Apache License 2.0

6.1k stars 1.04k forks source link

Due to the vocabulary size of the GPT-2 124M model being 50257, resizing the model's embedding layer dimensions may result in new embeddings that exceed the original vocabulary range. This can lead to the generation of token IDs that go beyond 50256. In the context of Reinforcement Learning with Human Feedback (RLHF), if the generated token IDs exceed the vocabulary range, the logprobs for these out-of-range tokens may become extremely small, manifesting as outliers. This can lead to numerical instability during training, potentially resulting in NaN values. Clipping the logprobs may help mitigate this issue (some experiments have been conducted, but it has not yet been widely validated).

microsoft / DeepSpeedExamples

Actor loss nan and Resizing model embedding #922