huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.02k stars 1.27k forks source link

RewardTrainer fails with FSDP #1195

Closed mgerstgrasser closed 10 months ago

mgerstgrasser commented 10 months ago

I've just run into an odd issue with FSDP & RewardTrainer. It seems then when using FSDP, the output of the (sequence classification) model's forward function isn't as expected. Normally, it returns a SequenceClassifierOutputWithPast where logits contains a tensor with the logits, and loss is empty or contains some sort of generator object.. When using FSDP, I'm getting a dict inside the loss field (and oddly enough that dict again contains a single key logits, althouh that's not the issue).

Not sure why this happens, but the net effect is that when the RewardTrainer tries to get the logits through model(...)[0] (see here), in the non-FSDP case it gets the logits, while in the FSDP case it gets the dict from the now non-emptyloss field, and then fails a few lines later.

Two questions:

  1. This is easily fixed by doing model(...)["logits"] instead. Any problem with doing that?
  2. Purely out of curiosity, does anyone know why this behaves differently with FSDP?

To reproduce: Run examples/scripts/reward_modeling.py with accelerate + FSDP.

forward output in a single process:

SequenceClassifierOutputWithPast(loss=<generator object gather.<locals>.gather_map.<locals>.<genexpr> at 0x15360f993040>, logits=tensor([[...]], device='cuda:0', grad_fn=<GatherBackward>), past_key_values=None, hidden_states=None, attentions=None)

And in FSDP:

SequenceClassifierOutputWithPast(loss={'logits': tensor([[...]], device='cuda:1', grad_fn=<ToCopyBackward0>)}, logits=tensor([[...]], device='cuda:1', grad_fn=<ToCopyBackward0>), past_key_values=None, hidden_states=None, attentions=None)
younesbelkada commented 10 months ago

thanks for the deepdive, I left a suggestion on the PR, lmk what do you think