Closed zhyang2226 closed 4 months ago
Hi @zhyang2226 ,
The left padding in RLHF is due to the need of batched online sampling from the model. Theoretically, since RoPE embedding is postional agnostic, it should not affect the performance of the model. However, we do observe different results when playing with the model using right padding, perhaps due to numerical issues, but it's not qualitatively different.
You may refer to this issue https://github.com/llava-rlhf/LLaVA-RLHF/issues/29 for more details.
The "right" padding in SFT and RL initialization is because it doesn't matter whether to use "left" or "right" padding. So we just follow the common practice such as in LLaVA to use the "left" padding.
The "left" padding in RL and reward modeling is because RL needs to collect policy rollouts, while reward modeling needs to do a regression on the last token, both of which cases require the model to aligned on the right side (new token side).
Best, Zhiqing
Dear Authors,
Firstly, I would like to extend my gratitude for open-sourcing your work and congratulate you on your novel and inspiring contributions to the field of LMM.
I have a question regarding the training settings of your proposed LLaVA-RLHF framework. I noticed that during the SFT process, the classical setting is followed, utilizing right-padding for prompts in batch inputs, as typical for LLaMA models. However, during the RLHF process, there is a shift to left-padding for prompts. Could you please clarify the motivation behind this change in padding strategy? I am concerned that left-padding might lead to suboptimal evaluation results for the SFT model, which is trained with right-padding.
Thank you for your time and consideration. I look forward to your response and appreciate your efforts in advancing the research community.