Image processing for mllama is broken for Wx1 (i.e. height == 1) image sizes

Pernekhan commented 2 days ago

System Info

When image size of 1x1 or Wx1 is passed, the normalize() method crashes with the following error:

File "/usr/local/lib/python3.12/dist-packages/transformers/models/mllama/image_processing_mllama.py", line 711, in preprocess
ERROR 10-07 06:31:28 engine.py:157]     image = self.normalize(
ERROR 10-07 06:31:28 engine.py:157]             ^^^^^^^^^^^^^^^
ERROR 10-07 06:31:28 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py", line 111, in normalize
ERROR 10-07 06:31:28 engine.py:157]     return normalize(
ERROR 10-07 06:31:28 engine.py:157]            ^^^^^^^^^^
ERROR 10-07 06:31:28 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/transformers/image_transforms.py", line 392, in normalize
ERROR 10-07 06:31:28 engine.py:157]     raise ValueError(f"mean must have {num_channels} elements if it is an iterable, got {len(mean)}")
ERROR 10-07 06:31:28 engine.py:157] ValueError: mean must have 1 elements if it is an iterable, got 3

Who can help?

@amyeroberts, @qubvel

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoImageProcessor
from PIL import Image

if __name__ == "__main__":
    image_processor = AutoImageProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

    data = Image.new("RGB", (1, 1))

    data = image_processor.preprocess(data, return_tensors="pt").data

    print(data)

Expected behavior

It shouldn't crash

qubvel commented 2 days ago

Hi @Pernekhan, thanks for opening an issue! Just curious what is the use-case for such a size?

Pernekhan commented 2 days ago

Hi @Pernekhan, thanks for opening an issue! Just curious what is the use-case for such a size?

I personally don't know what's a use case, but we see some people passing such images in our system.

qubvel commented 2 days ago

Actually, I'm not sure mllama is intended to be used with such an image size, so that's probably fine that it crashed rather than silently producing some weird result.

Pernekhan commented 2 days ago

Actually mllama is smart and it can work with such small images, for example, when you pass an image with 1x2 sizes Llama-3.2-11B-Vision-Instruct is able to respond something like this: The image is a solid black background with no discernible objects or features. There is no title or description provided for the image. The image appears to be a blank or empty space, with no visual elements or details to suggest what it might represent. It could be a placeholder image, a test image, or simply a blank slate for further content or design. Without more context, it's difficult to provide a more specific interpretation of the image.

1x1 or Wx1 images are also valid images, and mllama should be able to work with those images (it already works with 1xH size images, 2x2 size images)

qubvel commented 2 days ago

Thanks for the clarification! I still think it might be an error of someone doing a request and can't imagine a meaningful usecase, but I might missing smth 🙂

In case the fix is simple and does not bloat the code with special-case conditions we would appreciate it. Would you have the bandwidth to submit a fix?

Pernekhan commented 1 day ago

Yes, it's hard to imagine a use-case for it, but we see cases in production.

My plate is full at the moment, but maybe someone else can pick it up?

Also, it's similar to this issue: https://github.com/huggingface/transformers/issues/21638

elsatch commented 1 day ago

Hi everyone! We decided to work on this issue during the Hacktoberfest meetup of PyData Madrid (we had around 90-100 minutes to work on the issue at hand). Some things we discovered:

Regarding the use-case, my intuition is that people might be scrapping websites and passing the images to the vision model. The most common use for 1x1 images in pages would be tracking pixels, that were popularized by Facebook/Meta. Technical specs here: https://developers.facebook.com/docs/meta-pixel/advanced
During the session we tried to pinpoint the cause of the problem. One member of the team requested access to the model and was asked to wait for approval. A few minutes later, access was declined. So, we tried then to check if this error could be replicated with other models.
Searching on popular models on docs, we decided to test it with "google/vit-base-patch16-224-in21k"
This is the sample code to test it, which returns the same error:

from transformers import AutoImageProcessor
from PIL import Image

if __name__ == "__main__":
    image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
    data = Image.new("RGB", (1, 1))
    data = image_processor.preprocess(data, return_tensors="pt").data
    print(data)

Debugging the code it took us to the infer_channel_dimension_format that thakes the image as a np.ndarray and tries to infer the channel dimensions. This function will return the warning "The channel dimension is ambiguous. Got image shape {image.shape}. Assuming channels are the first dimension." if the second dimension of the image is 1. (As @Pernekhan pointed out, this not only applies to 1x1 images but to all images as in Wx1).

Our current assumption is that when passing the values down to the normalize task, this causes the normalization to fail, resulting in the problem described in this issue. Given this is used by most vision models, this issue should apply to all these models too.

P.D To be real meta, I asked Meta Llama Vision Free endpoint at Together about the usage of 1x1 pixel images, asking if these could be tracking pixels. These are the prompt and the response:

Prompt: "I scrapped the content of a web site, after obtaining permission from the owner. Then I processed the images present and discovered a 1x1 pixel that was barely visible in the page. What could be this 1x1 pixel image for? Is there any use for such images in web pages? Could this be a tracking pixel? Reply in English."

Response: "The infamous 1x1 pixel image!

Yes, it's not uncommon to find tiny images like this on web pages. Here are some possible explanations:

Tracking pixel: As you suspected, a 1x1 pixel image can be used as a tracking pixel. It's a small, invisible image that loads when a user visits a webpage, allowing the website owner to track visitor behavior, such as page views, clicks, or even user interactions. This pixel can be used to collect data about the user's browser, device, or location.
Analytics and monitoring: Similar to tracking pixels, these small images can be used to monitor website performance, such as page load times, errors, or other technical issues. They can also help website owners track the effectiveness of their content or advertising campaigns.
Advertising and retargeting: Advertisers might use 1x1 pixel images to track users who have interacted with their ads or content. This allows them to serve targeted ads or retarget users who have abandoned their shopping carts.
Social media tracking: Social media platforms, like Facebook or Twitter, might use 1x1 pixel images to track user interactions on websites, such as likes, shares, or comments.
Content optimization: These small images can be used to test different content variations, such as images or ads, to see which ones perform better.
Security and anti-spam measures: Some websites might use 1x1 pixel images as a security measure to detect and prevent spam bots or malicious traffic.
SEO and crawling: Search engines like Google might use 1x1 pixel images to help crawl and index websites more efficiently.

To determine the purpose of the 1x1 pixel image you found, you can try the following:

Check the image's URL or source code to see if it's linked to a specific analytics or tracking service.
Look for any JavaScript code or cookies associated with the image.
Check if the image is loaded from a third-party domain or a specific CDN (Content Delivery Network).
Use tools like the browser's developer tools or a website scanner to analyze the image's behavior and potential tracking activities.

Keep in mind that not all 1x1 pixel images are malicious or invasive. Many are used for legitimate purposes, such as analytics or advertising. However, if you're concerned about user privacy or security, it's always a good idea to investigate further."

huggingface / transformers