Closed hannah348 closed 4 months ago
Hello ! Thanks for the catch ! For PaliGemma, the tokenizer should be set with padding side = "right" ! I am pushing an update to force that behaviour and will push updated checkpoints, thanks for the catch ! In practice, as is with a mock image, it just introduces extra noise but should still work !
Does the model require that extra noise? If this was done during training as well the padding tokens might lead to degradation to performance as the model has less tokens available to represent the text, since the attention mask does not exclude the image tokens in colpali_engine/models/paligemma_colbert_architecture.py
line 83
In practice, the tokens corresponding to the input_ids that must be replaced by the image soft tokens, currently gets included in the query - since it is associated with nothing (and never trained cause replaced otherwise), this acts as a learned padding token (but with attention). It should not hurt performance particularly and might even act as extra "buffer tokens".
I am retraining checkpoints with the fix (since it happens during training as well) which I will release once I push the update, along with the benchmark results !
So everythinh shoild be fixed !
Would be awesome if you can confirm using this new model: https://huggingface.co/vidore/colpali-v1.1
and the code in branch: https://github.com/illuin-tech/colpali/tree/hard-negs
The base model version is fixed, and padding side is set to right, so issue should be fine @hannah348
I was using the
process_queries
method and I realized that the processor uses left padding. As a result thebatch_query["input_ids"] = batch_query["input_ids"][..., processor.image_seq_length :]
does cut of the padding tokens and if there were padding tokens in the sequence the end of the image sequence is the beginning of the query. Is that intentional? If so what is the reasoning behind that?