huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.92k stars 26.27k forks source link

Everything CLIP related seems to break starting form transformers 4.28.0 #24857

Closed andreaferretti closed 12 months ago

andreaferretti commented 1 year ago

System Info

Who can help?

@amyeroberts

Information

Tasks

Reproduction

It seems to me that there is some regression starting from transformers 4.28.0 that affects the CLIP vision model and everything related to it.

In particular, I am having issue with

ClipSeg

For ClipSeg, I am able to use it and get the expected masks, essentially by literally following the example here:

from transformers import AutoProcessor, CLIPSegForImageSegmentation
from PIL import Image
import requests

processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["a cat", "a remote", "a blanket"]
inputs = processor(text=texts, images=[image] * len(texts), padding=True, return_tensors="pt")

outputs = model(**inputs)

logits = outputs.logits
print(logits.shape)

Then logits contains the logits from which I can obtain a mask by something like

mask = torch.exp(logits)
mask /= mask.max()

I tested this and it works reliably until transformers 4.27.4. But with transformers 4.28.0, I get masks that are completely black regardless of the input image.

ClipVisionModel

This is harder to describe, since it relies on an internal model. I have trained a model that makes use of the image embeddings generated by ClipVisionModel for custom subject generation. Everything works well until transformers 4.27.4. If I switch to 4.28.0, the generated image changes completely. The only change is installing 4.28.0.

In fact, if I save the embeddings generated by CLIPVisionModel with the two different versions for any random image, I see that they are different. to be sure, this is how I generate image embeddings:

clip = CLIPModel.from_pretrained(...)
preprocessor = CLIPProcessor.from_pretrained(...)
...
encoded_data = preprocessor(
    text=prompts,
    images=images,
    return_tensors="pt",
    max_length=77,
    padding="max_length",
    truncation=True,
)
clip_output = clip(
    input_ids=encoded_data.input_ids,
    pixel_values=encoded_data.pixel_values,
)
image_embeds =clip.visual_projection(
    clip_output.vision_model_output.last_hidden_state
)

For reference, I am using clip-vit-large-patch14

Expected behavior

I would expect CLIPVisionModel to give the same result on the same image, both in 4.27.4 and in 4.28.0

andreaferretti commented 1 year ago

Just to make something reproducible, here we can see that the output of CLIPProcessor changes. I run the script

from PIL import Image
import requests

import transformers
from torchvision.transforms.functional import to_tensor
from transformers import CLIPProcessor

processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
reference = to_tensor(image)

encoded_data = processor(
    text=[""],
    images=[reference],
    return_tensors="pt",
    max_length=77,
    padding="max_length",
    truncation=True,
)

print(transformers.__version__)
print(encoded_data.pixel_values.mean())

With 4.27.4 I get

4.27.4
tensor(0.2463)

With 4.28.0 I get

4.28.0
tensor(-1.6673)
andreaferretti commented 1 year ago

I figured out the issue: the CLIPProcessor expects tensors in the range [0, 255], but only starting from transformers 4.28.0. This seems a pretty breaking change to me! If I multiply my tensor by 255, I get the right results

NielsRogge commented 1 year ago

Hi,

Thanks for reporting. This seems related to https://github.com/huggingface/transformers/issues/23096 and may be caused by https://github.com/huggingface/transformers/pull/22458. cc @amyeroberts

amyeroberts commented 1 year ago

Hi @andreaferretti, thanks for raising this issue!

What's being observed, is actually a resolution of inconsistent behaviour of the previous CLIP feature extractors. I'll explain:

In the previous behaviour, images after resizing kept their upscaled values. Currently, if an image was upscaled during resizing, the pixel values are downscaled back e.g. to between 0-1. This ensures that the user can set do_resize to True or False and the only difference in the output image is its size (and interpolated pixels). Previously, if you set do_resize=False, then your image pixel values are never upscaled, they remain between 0-1, would be downscaled again, as is happening now.

Rather than try to infer processor behaviour based on inputs, we keep the processing behaviour consistent and let the user explicitly control this. If you wish to input images with pixel values that have been downscaled, then you just need to tell the image processor not to do any additional scaling using the do_rescale flag:

outputs = image_processor(images, do_rescale=False)

Alternatively, you could pass in the images without calling to_tensor.

In the issues linked by @NielsRogge, this is also explained: https://github.com/huggingface/transformers/issues/23096#issuecomment-1557699476

However, this is the second time a similar issue has been raised, indicating that the behaviour is unexpected. I'll think about how to best address this with documentation or possible warning within the code.

andreaferretti commented 1 year ago

Yeah, it would be useful to add a warning mentioning do_rescale, as well as mention this issue in the documentation of CLIP and related models

sayakpaul commented 7 months ago

I am still getting widely different results on the JAX implementation of scenic for CLIP and the one we have in transformers (PyTorch).

from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
import torch
from PIL import Image 

import jax
import numpy as np
from scenic.projects.baselines.clip import model as clip

def _clip_preprocess(images, size):
  target_shape = images.shape[:-3] + (size, size, images.shape[-1])
  images = jax.image.resize(images, shape=target_shape, method='bicubic')
  images = clip.normalize_image(images)

  return images

def get_image_in_format(image, size, format="pt"):
    images = np.array(image) / 255.
    images = np.expand_dims(images, 0)
    pp_images = _clip_preprocess(images, size)

    if format == "pt":
        inputs = {}
        inputs["pixel_values"] = torch.from_numpy(np.array(pp_images))
        inputs["pixel_values"] = inputs["pixel_values"].permute(0, 3, 1, 2)
        return inputs 

    inputs = pp_images
    return inputs

# Comes from https://huggingface.co/datasets/diffusers/docs-images/blob/main/amused/glowing_512_2.png
image = Image.open("glowing_512_2.png")
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-large-patch14-336").eval()

inputs = get_image_in_format(image, processor.crop_size["height"], format="pt")
with torch.no_grad():
    output = model(**inputs)

temp = output.image_embeds[0, :4].numpy().flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))
print("=====Printing JAX model=====")

_CLIP_MODEL_NAME = 'vit_l14_336px'
_model = clip.MODELS[_CLIP_MODEL_NAME]()
_model_vars = clip.load_model_vars(_CLIP_MODEL_NAME)
input_image_size = clip.IMAGE_RESOLUTION[_CLIP_MODEL_NAME]

images = get_image_in_format(image, size=input_image_size, format="jax")
temp = np.asarray(image_embs[0, :4]).flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))

Gives:

-0.0898, 0.1304, 0.2402, -0.0378
=====Printing JAX model=====
-0.0046, 0.0068, 0.0124, -0.0020

for what seems to be quite different for the exact same input.

@sanchit-gandhi would you have a clue about it?

NielsRogge commented 7 months ago

Hi,

Not sure if you're comparing apples-to-apples, when comparing the original CLIP repository to the Transformers one, they match: https://colab.research.google.com/drive/15ZhC32ovBKAU5JqC-kcIOntW_oU-JrkB?usp=sharing.

Scenic is not the original implementation of CLIP so there might be some differences. I would first check whether the Scenic implementation outputs the same logits as the OpenAI CLIP repository.

sayakpaul commented 7 months ago

You are right:

import clip
import torch 
import jax
import numpy as np
from scenic.projects.baselines.clip import model as clip_scenic

inputs = np.random.randn(1, 336, 336, 3)
model, preprocess = clip.load("ViT-L/14@336px", device="cpu")

with torch.no_grad():
    image = torch.from_numpy(inputs.transpose(0, 3, 1, 2))
    image_features = model.encode_image(image).numpy()
    print(image_features.shape)

temp = image_features[0, :4].flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))
print("=====Printing JAX model=====")

_CLIP_MODEL_NAME = 'vit_l14_336px'
_model = clip_scenic.MODELS[_CLIP_MODEL_NAME]()
_model_vars = clip_scenic.load_model_vars(_CLIP_MODEL_NAME)

images = jax.numpy.array(inputs)
image_embs, _ = _model.apply(_model_vars, images, None)
print(image_embs.shape)
temp = np.asarray(image_embs[0, :4]).flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))

Gives:

(1, 768)
-0.1827, 0.7319, 0.8779, 0.4829
=====Printing JAX model=====
(1, 768)
-0.0107, 0.0429, 0.0514, 0.0283

Sorry for the false alarm here. Have raised an issue: https://github.com/google-research/scenic/issues/991.