Multilingual Colpali Adapter Training

contrebande-labs commented 1 month ago

Hi!

FIrst of all, congrats and thanks for opensourcing Colpali.

I see everywhere in the colpali model cards that it has been trained only in english and only on single page A4-sized PDF documents.

So I was wondering if I could get some advce on how to train a multilingual, multipage, image ratio and type independent colpali adapter (for pictures and documents, in all the languages the undelying VLM can support).

Is PaliGemma still the best choice? There are not tons of opensource VLMs, but there is Idefics, Florence (I know youtried these), Llava-Mistral, etc.

Maybe also, there can be a Vidore benchmark for multilingual retrieval?

If I can get some guidance, I'm up for contributing these features to your repositories.

Thanks!

tonywu71 commented 1 month ago

Hi @contrebande-labs, thanks for the kind message! 👋🏼

To be as clear as possible, I'll answer your questions in order.

So I was wondering if I could get some advce on how to train a multilingual, multipage, image ratio and type independent colpali adapter (for pictures and documents, in all the languages the undelying VLM can support).

For multilingual support, we have demonstrated in our paper that ColPali can easily be fine-tuned on non-English languages (see Section 6: Ablation study, subsection Can the model adapt to new tasks?). For your use case, you'll simply need a new dataset covering the languages of interest. Then you can further train our ColPali checkpoint. Even if the model will be regularized if you decide to use a LoRA adapter, I would recommend keeping some English documents to prevent catastrophic forgetting.

Our proposed architecture in ColPali considers each page as a document. Thus, you won't be able to train a multipage version of our model, at least not without any prior modification in how the pages are ingested.

The image ratio is handled differently depending on the vision backbone of the VLM. For ColPali, which is based on PaliGemma, the images are simply reshaped to a square format. Note that the authors of PaliGemma have tested preserving aspect ratios so as not to impact the performance of their VLM. If you really want to preserve ratios, you should use our ColIdefics model as it uses Idefics-2 (which uses the SPHINX ratio method). And if you feel like training your own ColPali-like retriever, I'd recommend using the new Idefics-3 model HuggingFace has released a few days ago.

Is PaliGemma still the best choice? There are not tons of opensource VLMs, but there is Idefics, Florence (I know youtried these), Llava-Mistral, etc.

For the choice of the VLM backbone, it is hard to predict which model will give the best performance without proper testing. However, I'd suggest:

Looking at the VLM performance on DocVQA: while it's not a perfect benchmark, it still gives a sense of how the VLM would help in document retrieval.
Picking a VLM with a high resolution. Similarly to what Beyer et al. have stated in their PaliGemma paper, we observed that you'll need a resolution of 448 pixels minimum to efficiently embed your documents.

Maybe also, there can be a Vidore benchmark for multilingual retrieval?

My co-authors and I are working on an updated version of the ViDoRe benchmark. Follow us on Twitter to get early updates on this work!

If I can get some guidance, I'm up for contributing these features to your repositories.

Sure, we are super excited about anyone contributing to our vision of retrieval in vision space! 🫶🏼

Regards,

Tony

contrebande-labs commented 1 month ago

Hi @tonywu71 ! Thanks for the exhaustive reply.

It's not so much that I want to have in index in another (single) language than English. It's that I want to select a genuine (and not just incidentally) multilingual LLM backbone and keep the multilingual capabilities intact throughout the multiple training passes towards the "visual colbert" adatper.

For instance, I'm hesitating between the idefics-3 and the llava-onevision VLM adaptation methods on mistral-nemo-instruct-2407. Maybe I should do both and pick the best of the two based on a multilingual vidore-like benchmark.

I'm using Cassandra/Datastax AstraDB for the vector index backend and I have LOTS of multilingual scanned documents from many european and north-american national libraries. Our end application is a content generation (image/texts) for the perfumery, perfume making, olfactory arts, organic chemistry, voc synthesis patents and chemoreception domains. I understand the catastrophic forgetting and regularization concerns and I think I have already adopted sane mitigation practices to avoid such regressions. I think I'd like to work on the benchmark first to be able to guide the training. But I would appreciate some guidance on specifically that. Would the best way be the creation of dataset "addons" for the ones you already published? Have all the paper datasets been published yet? Are you planning on merging them on one single vidore benchmark dataset?

tonywu71 commented 1 month ago

It's not so much that I want to have in index in another (single) language than English. It's that I want to select a genuine (and not just incidentally) multilingual LLM backbone and keep the multilingual capabilities intact throughout the multiple training passes towards the "visual colbert" adatper.

I understand your use case now, thanks for the clarification!

For instance, I'm hesitating between the idefics-3 and the llava-onevision VLM adaptation methods on mistral-nemo-instruct-2407. Maybe I should do both and pick the best of the two based on a multilingual vidore-like benchmark.

For the choice of the model, I agree with you: it’d be better to try both. My gut feeling is that multilingual capabilities are unlocked by the LLM backbone. Here, Llama 3 (used by Idefics-3) was trained on "over 30 languages," and Qwen 2 (used by LLaVA-OV) on 29 languages, so I believe both models are relevant to your use case.

One extra interesting question though: will your RAG use case cover cross-language questions? If yes, the Qwen2 blog post mentions that "significant effort [was made] to address code-switching," so maybe LLaVA-OV will work better on these examples.

I'm using Cassandra/Datastax AstraDB for the vector index backend and I have LOTS of multilingual scanned documents from many european and north-american national libraries. Our end application is a content generation (image/texts) for the perfumery, perfume making, olfactory arts, organic chemistry, voc synthesis patents and chemoreception domains. I understand the catastrophic forgetting and regularization concerns and I think I have already adopted sane mitigation practices to avoid such regressions. I think I'd like to work on the benchmark first to be able to guide the training.

Since you've mentioned scanned documents, please note that our synthetic training and eval sets only contain clean PDF files. For our next dataset iteration, we are considering using image augmentation on these synthetic datasets to make them more similar to scans.

But I would appreciate some guidance on specifically that. Would the best way be the creation of dataset "addons" for the ones you already published? Have all the paper datasets been published yet? Are you planning on merging them on one single vidore benchmark dataset?

I'll have to first discuss this with the rest of the team (they are taking some days off at the moment). We'll publish the new dataset(s) once they're ready!

As far as I’m concerned, I would suggest creating your dataset on your personal Hugging Face org. Once finalized and if relevant, we can add it to our ViDoRe collection (at this time, we prefer to work with separate datasets). This will automatically include your new dataset in our evaluation pipeline from our other vidore-benchmark repository (see this section in particular).

ManuelFay commented 1 month ago

Hello, thanks for the interest !

For multilinguality, note that PaliGemma is massively multilingual already and we have shown it has good zero-shot capabilities in non-english languages. Although just trained on English, we see good perfs in French on ViDoRe for example, so this may already be enough for your use case !

For scanned documents, DocVQA contains a ton of scans so this should already work quite well ! There's also a version on the leaderboard called "docmatix-only / with-docmatic" with an even greater ratio of training that contains scanned PDFs spanning multiple domains !

ManuelFay commented 1 month ago

All the paper datasets for eval are published and we will release the training set in less than two weeks as a single big dataset !

contrebande-labs commented 1 month ago

Bonjour @ManuelFay !

I'm still researching my baseline. I'm currently using your finetunes, datasets and scripts first to reproduce your benchmarks. I'm not quite done yet. Then, here are the objectives I will be aiming for:

Replace the index/query infrastructure to match what we have in production (Cassandra 5 vector search);
Create a test benchmark corpus (or contribute to Vidore) for our use cases and domains: organic chemistry patents, papers, scanned books and ebooks (ePUB, PDF, etc.) from the mid-eighteenth century onward in French and English (then German, Dutch, Italian and Spanish) to retrieve compound monographs, synthesis pathways and formulas;
Create a Colpali adapter for an existing SOTA open-source VLLM (changes every week at the moment) to top the Vidore leaderboard;

I'm going to report here on the progress.

ManuelFay commented 4 weeks ago

Awesome ! Very eager to see the progress, don't hesitate to tag me ! The bottleneck for adapting other VLMs is often the necessary VRAM ! best bet is to use mined negative training with very small batch sizes :)

ManuelFay commented 2 weeks ago

here's the dataset: https://huggingface.co/datasets/vidore/colpali_train_set

contrebande-labs commented 2 weeks ago

Great thanks!

efenocchi commented 2 weeks ago

Hi everyone @ManuelFay @contrebande-labs @tonywu71, sorry to bother you again.

I'm trying to swap the backbone using Idefics3, but I'm encountering some errors. I'm unable to get batch_query from the text due to the following error: sample += image_prompt_string + split_sample[i + 1] in transformers/src/transformers/models/idefics3/processing_idefics3.py (line 324). I tried using the forward_vision_pali approach by passing the image and then deleting pixel_values and following the steps provided in that function:

del batch_query["pixel_values"]
batch_query["input_ids"] = batch_query["input_ids"][..., self.processor.image_seq_len :]
batch_query["attention_mask"] = batch_query["attention_mask"][..., self.processor.image_seq_len :]
...

However, this approach gives an error in the compute_loss function when I call the model:

query_outputs = model(input_ids=inputs["query_input_ids"], attention_mask=inputs["query_attention_mask"])

I get 'NoneType' object has no attribute 'get_seq_length' in past_seen_tokens = past_key_values.get_seq_length() in transformers/src/transformers/models/idefics3/modeling_idefics3.py.

Has anyone tried this model?

I followed the same steps used for idefics2 but replaced the various idefics3 classes taken from this PR.

ManuelFay commented 2 weeks ago

Hey ! Def a model we want to support and I'm going to look into ! It's important to update the collator function to make the forward pass compatible with idefics3 structure ! maybe it requires a template etc ! If you look into our code,w e support idefics2 already, adapting to idefics3 might require a bit more work but it's crucial to make sure the input batch is given the way idefics3 expects it (reading the docs + debugger often is the way !)

I'm planning to look into it in the coming days if you can wait a bit

illuin-tech / colpali

Multilingual Colpali Adapter Training #19