huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.39k stars 27.09k forks source link

Loading LLaVA-NeXT from AutoProcessor recently started causing errors in the forward function #34894

Open 7AtAri opened 1 day ago

7AtAri commented 1 day ago

System Info

container built on 24th of october: including pip install tqdm pip install torch pip install torchvision pip install transformers pip install deepspeed==0.15.2 pip install accelerate pip install wandb pip install lightning pip install optuna pip install ray[tune] pip install pyarrow pip install nltk

pip install pandas pip install numpy pip install matplotlib

pip install scipy pip install scikit-learn

pip install bitsandbytes pip install peft pip install pillow pip install flash-attn --no-build-isolation

Who can help?

@zucchini-nlp @arthur

Information

Tasks

Reproduction

using a batch size of 8 in a train_collate function with the AutoProcessor as well as LLaVA-Next Processor:

processor = AutoProcessor.from_pretrained(cfg.MODEL_ID_LLAVA_NEXT) # this used to work 
# processor = LlavaNextProcessor.from_pretrained(cfg.MODEL_ID_LLAVA_NEXT)  # same result if I try this instead

def train_collate_fn(examples, processor):
  processor.tokenizer.padding_side = "right"  # during training pad on the right side
  # collect images, texts and true coordinates of the examples batch
  images = [example[0] for example in examples]
  #print("image", images[0])
  texts = [example[1][0]["value"] for example in examples]

  batch = processor(images=images,                # images to be processed
                    text=texts,                   # text to be tokenized
                    padding=True,                 # pad the sequences if not long enough
                    truncation=True,              # truncate sequences longer than max_length
                    max_length=cfg.MAX_LENGTH,    # set the maximum output token length
                    return_tensors="pt")          # return a pytorch tensor
  return batch["input_ids"], batch["attention_mask"], batch['pixel_values'], batch["image_sizes"]

in the forward pass:

the forward pass

    outputs = self.base_model(
        input_ids=input_ids, 
        attention_mask=attention_mask, 
        pixel_values=pixel_values,
        image_sizes = image_sizes,
        #vision_feature_layer =-1,
        #vision_feature_select_strategy = "full", 
        output_hidden_states = True,  # all hidden states should be output
        return_dict = True  # to get a dict output that can be accessed with the . operator
    )

    all of this worked two weeks ago. But now I get this error:

return self.model.forward(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llava_next/modeling_llava_next.py", line 873, in forward inputs_embeds, attention_mask, positionids, labels, = self._merge_input_ids_with_image_features( File "/opt/conda/lib/python3.10/site-packages/transformers/models/llava_next/modeling_llava_next.py", line 551, in _merge_input_ids_with_image_features raise ValueError( ValueError: Number of image tokens in input_ids (2040) different from num_images (8).

Expected behavior

did not throw this error before

zucchini-nlp commented 4 hours ago

@7AtAri hey which version of transformers you are using? Can you try to update to the latest v4.46 version with pip install transformers==4.46 as it works for me in the latest version

7AtAri commented 1 hour ago

with transformers==4.46 the same error persists

zucchini-nlp commented 51 minutes ago

@7AtAri it works for me with transformer 4.46 with the code from demo in the hub. If you are using jupyter notebook, make sure you restart the kernel and that the transformers version being imported is indeed v4.46. If the error persists, share your env with transformers-cli env please

The code should never go in the path with _merge_input_ids_with_image_features tbh

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="cuda:0") 

# prepare image and text prompt, using the appropriate prompt template
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image") 
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "What is shown in this image?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0")

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))