illuin-tech / colpali

The code used to train and run inference with the ColPali architecture.
https://huggingface.co/vidore
MIT License
1.11k stars 97 forks source link

Question about multiple page retrieval #130

Closed kubni closed 5 days ago

kubni commented 1 week ago

Hello. I use colqwen2 for document retrieval. Let's say that I have a 80 page document and that one of those pages has a title on it that says "Institutional agencies". Under the title, various agencies are listed. Then, on the next 2 pages there are more agencies listed.

However, when I have a "List me all of the institutional agencies" query, ColQwen2 accurately retrieves the first page I mentioned, but doesn't retrieve other 2, probably because the other 2 don't have "Institutional agencies" explicitly mentioned.

How can I approach this problem? I could manually send a couple pages after the retrieved one along with it to the llm, but that would just be hardcoding for this specific instance of this problem.

ManuelFay commented 5 days ago

That's a common problem for info retrieval... I am looking into multipage embeddings, and in text, it's common to gconcatenate some metadata (ex: table of contents) to the paragraph but here it's less trivial. You might be able to modify the text prompt of the image embedding and add some metadata but not sure if it will work well if you don't retrain.

Tl;Dr working on it but non trivial as is

kubni commented 4 days ago

@ManuelFay I have found a paper from a couple days ago called "M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding". Arxiv link: Link It utilizes Colpali. Do you think it could be used to solve this problem? The code isn't available yet, but it says on their site that it's coming soon.

ManuelFay commented 4 days ago

Yeah saw it ! I mean you don't really need the code, it's basically just sending the top pages to a VLM. What you could do in your case is always send the page detected by ColPali and the 3 pages before and after to give context, and send everything to a VLM. That would work. Check out the notebooks on my GitHub (tutorials)

kubni commented 3 days ago

Yeah thats the way I m doing it right now. Though I found it sufficient so far to just send 3 pages after the page retrieved by ColPali / ColQwen.

Yeah saw it ! I mean you don't really need the code, it's basically just sending the top pages to a VLM. What you could do in your case is always send the page detected by ColPali and the 3 pages before and after to give context, and send everything to a VLM. That would work. Check out the notebooks on my GitHub (tutorials)