illuin-tech / colpali

The code used to train and run inference with the ColPali architecture.
https://huggingface.co/vidore
MIT License
926 stars 84 forks source link

Example code is not working #107

Open maxjeblick opened 1 day ago

maxjeblick commented 1 day ago

When I run the example given in the README.md, I encounter the following error:

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same

Code is working if I cast pixel_values explicitly:

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

batch_images["pixel_values"] = batch_images["pixel_values"].to(torch.bfloat16)
# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

Stacktrace:

/home/max/PycharmProjects/colpali/venv/bin/python /home/max/PycharmProjects/colpali/example.py 
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.37it/s]
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text and `<bos>` token after that. For this call, we will infer how many images each text has and add special tokens.
Traceback (most recent call last):
  File "/home/max/PycharmProjects/colpali/example.py", line 32, in <module>
    image_embeddings = model(**batch_images)
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/colpali_engine/models/paligemma/colpali/modeling_colpali.py", line 38, in forward
    outputs = self.model(*args, output_hidden_states=True, **kwargs)  # (batch_size, sequence_length, hidden_size)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/transformers/models/paligemma/modeling_paligemma.py", line 496, in forward
    image_features = self.get_image_features(pixel_values)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/transformers/models/paligemma/modeling_paligemma.py", line 405, in get_image_features
    image_outputs = self.vision_tower(pixel_values)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/transformers/models/siglip/modeling_siglip.py", line 1190, in forward
    return self.vision_model(
           ^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/transformers/models/siglip/modeling_siglip.py", line 1089, in forward
    hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/transformers/models/siglip/modeling_siglip.py", line 311, in forward
    patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/max/PycharmProjects/colpali/venv/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same

Process finished with exit code 1
ManuelFay commented 1 day ago

Interesting ! Not sure what changed, I also recently ran into this for another model, maybe there has been a recent update on the processor size disabling autocasting or something ! I'll update, but first want to look into it ! thanks !