nebuly-ai / optimate

A collection of libraries to optimise AI model performances
https://www.nebuly.com/
Apache License 2.0
8.37k stars 643 forks source link

[ChatLLaMA] RLHF Training: dimension mismatch #312

Open BigRoddy opened 1 year ago

BigRoddy commented 1 year ago

I am getting the following error when doing RLHF training: Traceback (most recent call last): File "/code/main.py", in rlhf_trainer.train() File "/code/trainer.py", in train self.learn(memories) File "/code/trainer.py", in learn surr1 = advantages * ratios RuntimeError: The size of tensor a (29) must match the size of tensor b (38) at non-singleton dimension 1

And I output some shape of the tensors: rewards shape: torch.Size([1, 29]) old_values shape: torch.Size([1, 29]) actions_logits shape: torch.Size([1, 38, 50272]) old_actions_log_probs shape: torch.Size([1, 38]) ratios shape: torch.Size([1, 38]) advantages shape: torch.Size([1, 29])

This seems to be due to the fact that my actor and critic use different family models (opt-125m and gpt2)?

BigRoddy commented 1 year ago

https://github.com/nebuly-ai/nebullvm/blob/f8796c25aa6b5428a16c4929cdcfe7ea9f5b3f27/apps/accelerate/chatllama/chatllama/rlhf/trainer.py#L323 https://github.com/nebuly-ai/nebullvm/blob/f8796c25aa6b5428a16c4929cdcfe7ea9f5b3f27/apps/accelerate/chatllama/chatllama/rlhf/trainer.py#L348

It looks like these two pieces of code are miscalculated. What it wants to calculate here is the length of the actions under different actor and critic tokenizer encoding conditions, but the actor actually calculates the length of the actions and assigns the value to the action_len_actor, but the critic calculates the length of the inputs states_critic and assigns to the action_len_critic, resulting in a mismatch in the dimensions of logits and values.

PierpaoloSorbellini commented 1 year ago

Hi @BigRoddy thanks for reaching out! Yep I think that this could be a bug in the code, we haven't fully tested the training under different tokenizers even if we have started to implement it. Thanks for the issue and the detailed analysis, if you want to contribute to the repo feel free to open a PR with the adjustment!

PierpaoloSorbellini commented 1 year ago

Hi @BigRoddy This Should fix your issue if I stand correctly: You can find all the source code in the PR #306

if self.use_same_tokenizer:
    sequences_critic = sequences_actor
    sequences_mask_critic = sequences_mask_actor
    action_len_critic = action_len_actor
else:
    encoded_critic = change_tokenization(
        sequences_actor,
        self.actor.tokenizer,
        self.critic.tokenizer,
    )
    # split the encoded_critic in tokens and maks
    sequences_critic = encoded_critic["input_ids"].to(
        sequences_actor.device,
    )
    sequences_mask_critic = (
        encoded_critic["attention_mask"]
        .to(sequences_actor.device)
        .long()
        .detach()
    )

    # compute len of actions for the critic tokenizer
    action_len_critic = states_critic.shape[1]