huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.23k stars 26.61k forks source link

Image processing for mllama is broken for Wx1 (i.e. height == 1) image sizes #34029

Open Pernekhan opened 2 days ago

Pernekhan commented 2 days ago

System Info

When image size of 1x1 or Wx1 is passed, the normalize() method crashes with the following error:

File "/usr/local/lib/python3.12/dist-packages/transformers/models/mllama/image_processing_mllama.py", line 711, in preprocess
ERROR 10-07 06:31:28 engine.py:157]     image = self.normalize(
ERROR 10-07 06:31:28 engine.py:157]             ^^^^^^^^^^^^^^^
ERROR 10-07 06:31:28 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py", line 111, in normalize
ERROR 10-07 06:31:28 engine.py:157]     return normalize(
ERROR 10-07 06:31:28 engine.py:157]            ^^^^^^^^^^
ERROR 10-07 06:31:28 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/transformers/image_transforms.py", line 392, in normalize
ERROR 10-07 06:31:28 engine.py:157]     raise ValueError(f"mean must have {num_channels} elements if it is an iterable, got {len(mean)}")
ERROR 10-07 06:31:28 engine.py:157] ValueError: mean must have 1 elements if it is an iterable, got 3

Who can help?

@amyeroberts, @qubvel

Information

Tasks

Reproduction

from transformers import AutoImageProcessor
from PIL import Image

if __name__ == "__main__":
    image_processor = AutoImageProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

    data = Image.new("RGB", (1, 1))

    data = image_processor.preprocess(data, return_tensors="pt").data

    print(data)

Expected behavior

It shouldn't crash

qubvel commented 2 days ago

Hi @Pernekhan, thanks for opening an issue! Just curious what is the use-case for such a size?

Pernekhan commented 2 days ago

Hi @Pernekhan, thanks for opening an issue! Just curious what is the use-case for such a size?

I personally don't know what's a use case, but we see some people passing such images in our system.

qubvel commented 2 days ago

Actually, I'm not sure mllama is intended to be used with such an image size, so that's probably fine that it crashed rather than silently producing some weird result.

Pernekhan commented 2 days ago

Actually mllama is smart and it can work with such small images, for example, when you pass an image with 1x2 sizes Llama-3.2-11B-Vision-Instruct is able to respond something like this: The image is a solid black background with no discernible objects or features. There is no title or description provided for the image. The image appears to be a blank or empty space, with no visual elements or details to suggest what it might represent. It could be a placeholder image, a test image, or simply a blank slate for further content or design. Without more context, it's difficult to provide a more specific interpretation of the image.

1x1 or Wx1 images are also valid images, and mllama should be able to work with those images (it already works with 1xH size images, 2x2 size images)

qubvel commented 2 days ago

Thanks for the clarification! I still think it might be an error of someone doing a request and can't imagine a meaningful usecase, but I might missing smth 🙂

In case the fix is simple and does not bloat the code with special-case conditions we would appreciate it. Would you have the bandwidth to submit a fix?

Pernekhan commented 1 day ago

Yes, it's hard to imagine a use-case for it, but we see cases in production.

My plate is full at the moment, but maybe someone else can pick it up?

Also, it's similar to this issue: https://github.com/huggingface/transformers/issues/21638

elsatch commented 1 day ago

Hi everyone! We decided to work on this issue during the Hacktoberfest meetup of PyData Madrid (we had around 90-100 minutes to work on the issue at hand). Some things we discovered:

from transformers import AutoImageProcessor
from PIL import Image

if __name__ == "__main__":
    image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
    data = Image.new("RGB", (1, 1))
    data = image_processor.preprocess(data, return_tensors="pt").data
    print(data)

Our current assumption is that when passing the values down to the normalize task, this causes the normalization to fail, resulting in the problem described in this issue. Given this is used by most vision models, this issue should apply to all these models too.


P.D To be real meta, I asked Meta Llama Vision Free endpoint at Together about the usage of 1x1 pixel images, asking if these could be tracking pixels. These are the prompt and the response:

Prompt: "I scrapped the content of a web site, after obtaining permission from the owner. Then I processed the images present and discovered a 1x1 pixel that was barely visible in the page. What could be this 1x1 pixel image for? Is there any use for such images in web pages? Could this be a tracking pixel? Reply in English."

Response: "The infamous 1x1 pixel image!

Yes, it's not uncommon to find tiny images like this on web pages. Here are some possible explanations:

  1. Tracking pixel: As you suspected, a 1x1 pixel image can be used as a tracking pixel. It's a small, invisible image that loads when a user visits a webpage, allowing the website owner to track visitor behavior, such as page views, clicks, or even user interactions. This pixel can be used to collect data about the user's browser, device, or location.
  2. Analytics and monitoring: Similar to tracking pixels, these small images can be used to monitor website performance, such as page load times, errors, or other technical issues. They can also help website owners track the effectiveness of their content or advertising campaigns.
  3. Advertising and retargeting: Advertisers might use 1x1 pixel images to track users who have interacted with their ads or content. This allows them to serve targeted ads or retarget users who have abandoned their shopping carts.
  4. Social media tracking: Social media platforms, like Facebook or Twitter, might use 1x1 pixel images to track user interactions on websites, such as likes, shares, or comments.
  5. Content optimization: These small images can be used to test different content variations, such as images or ads, to see which ones perform better.
  6. Security and anti-spam measures: Some websites might use 1x1 pixel images as a security measure to detect and prevent spam bots or malicious traffic.
  7. SEO and crawling: Search engines like Google might use 1x1 pixel images to help crawl and index websites more efficiently.

To determine the purpose of the 1x1 pixel image you found, you can try the following:

Keep in mind that not all 1x1 pixel images are malicious or invasive. Many are used for legitimate purposes, such as analytics or advertising. However, if you're concerned about user privacy or security, it's always a good idea to investigate further."