llava-rlhf / LLaVA-RLHF

Aligning LMMs with Factually Augmented RLHF
https://llava-rlhf.github.io/
GNU General Public License v3.0
324 stars 25 forks source link

RuntimeError: The size of tensor a (577) must match the size of tensor b (257) at non-singleton dimension 1 #11

Closed HarrySSH closed 1 year ago

HarrySSH commented 1 year ago

Thanks for the previous explanation. I encountered another bug here.

_File "/.../anaconda3/envs/rlhf/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 200, in forward embeddings = embeddings + self.position_embedding(self.positionids) RuntimeError: The size of tensor a (577) must match the size of tensor b (257) at non-singleton dimension 1

I used the standard dataset, and model to train the RL pipeline, just to make sure I can get it running. what do you think can be the issue to cause this bug?

Many thanks

Edward-Sun commented 1 year ago

Hi,

The problem seems to be a mismatch between the clip model configuration and checkpoint. Which script are you using?

HarrySSH commented 1 year ago

Thanks for your help! I am using
train_rl_model.sh

Edward-Sun commented 1 year ago

Did you run the first two steps of RLHF, i.e., reward model training and RL model initialization?

HarrySSH commented 1 year ago

Did you run the first two steps of RLHF, i.e., reward model training and RL model initialization?

Yes! That's the reason I felt quite confused. :) the reward model is in the 13b and the model init is in the 7b And the final train_rl_model.sh is in the 7b folder

Edward-Sun commented 1 year ago

Note: For both 7b and 13b policy models, we use the same 13b reward model.

If you have done the 7b initialization, I will double-check the script.

HarrySSH commented 1 year ago

And for your information. I tried to change the position embedding directly on the embedding features on the dimension one, and meaning performance position embedding directly based on the dimension of the embedding instead of the inputs_ids. And then I get this error "embeddings = embeddings + position_ids RuntimeError: The size of tensor a (1024) must match the size of tensor b (577) at non-singleton dimension 2"

Not sure if this is an useful information though

HarrySSH commented 1 year ago

Note: For both 7b and 13b policy models, we use the same 13b reward model.

If you have done the 7b initialization, I will double-check the script.

Yes. You made that very clear on the readMe :) I think I followed that part

Edward-Sun commented 1 year ago

RuntimeError: The size of tensor a (577) must match the size of tensor b (257) at non-singleton dimension 1

I think you don't need to change the code. 577 = 24^2 + 1, and 257 = 16^2 +1.

clip-vit-large-patch14-336 used 336 / 14 = 24 patches for each dimension, while clip-vit-large-patch14 used 224 / 14 = 16. So the problem is the loaded checkpoint is not the same as the one in the configuration.

HarrySSH commented 1 year ago

ah thank you so much!

HarrySSH commented 1 year ago

I found the issue. In the script 7b initialization, it used the openai/clip-vit-large-patch14,

but the reward function in the 13b use VISION_TOWER=openai/clip-vit-large-patch14-336

I guess that's where the bug happened

Edward-Sun commented 1 year ago

Hi, the use of different vision modules is by design. I will rerun the script and see whether I can reproduce the bug. Thank you for your patience.

HarrySSH commented 1 year ago

Ah ! I am not saying what you wrote is incorrect, I just mean there are inconstant to each other and that's totally me misinterpreted your codes! I am retraining the reward function with the other clip model and I will let you know if my issue is resolved.

Edward-Sun commented 1 year ago

Hi Harry,

I have checked the code and the issue seems that in the cleaned script, the wrong base models are used for the policy and the reward model. The correct version should be:

POLICY_BASE_MODEL_NAME=LLaVA-RLHF-7b-v1.5-224/sft_model
RM_BASE_MODEL_NAME=LLaVA-RLHF-13b-v1.5-336/sft_model

The script is updated.

Also, one issue I found while re-running our public codebase is that transformers==4.31.0 should be used instead of transformers==4.33.0, otherwise, it would cause a length mismatch issue due to the inconsistency of the tokenizer implementation in the transformers package.

Sorry for the inconvenience.