LlavaNextVideo always assumes left padding when batch size is 1

System Info

Unrelated to this issue

Who can help?

@zucchini-nlp @Narsil

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

At the moment I don't have a concise script for reproducing, but I think the problem should be clear given the below description.

Basically I'm trying to finetune the new LlavaNextVideo model. The current implementation (see code below) will always assume left padding when the batch size is 1. This will cause a "index out of bound" error when training with batch size being 1, due to the unexpected left padding (instead of right padding) for training.

https://github.com/huggingface/transformers/blob/0fdea8607d7e01eb0e38a1ebeb7feee30a22f0cf/src/transformers/models/llava_next_video/modeling_llava_next_video.py#L565-L576

I can try to put up an example with concrete shape/dims to better show the behavior, but just on a high level I believe the desired behavior is that as long as we are in the training mode (determined by the _left_padding and _right_padding), we should set left_padding to False, regardless of whether batch size is 1.

Expected behavior

https://github.com/huggingface/transformers/blob/0fdea8607d7e01eb0e38a1ebeb7feee30a22f0cf/src/transformers/models/llava_next_video/modeling_llava_next_video.py#L565-L576

As long as we are in the training mode (determined by the _left_padding and _right_padding), we should set left_padding to False, regardless of whether batch size is 1.

huggingface / transformers