[ ] Vespa 🤝 ColPali: Efficient Document Retrieval with Vision Language Models — pyvespa documentation

Vespa 🤝 ColPali: Efficient Document Retrieval with Vision Language Models — pyvespa documentation

Snippet

"Vespa 🤝 ColPali: Efficient Document Retrieval with Vision Language Models

This notebook demonstrates how to represent ColPali in Vespa. ColPali is a powerful visual language model that can generate embeddings for images and text. In this notebook, we will use ColPali to generate embeddings for images of PDF pages and store them in Vespa. We will also store the base64 encoded image of the PDF page and some meta data like title and url. We will then demonstrate how to retrieve the pdf pages using the embeddings generated by ColPali.

ColPali: Efficient Document Retrieval with Vision Language Models Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo

ColPail is a combination of ColBERT and PailGemma:

ColPali is enabled by the latest advances in Vision Language Models, notably the PaliGemma model from the Google Zürich team, and leverages multi-vector retrieval through late interaction mechanisms as proposed in ColBERT by Omar Khattab.

Quote from ColPali: Efficient Document Retrieval with Vision Language Models 👀

The ColPali model achieves remarkable retrieval performance on the ViDoRe (Visual Document Retrieval) Benchmark. Beating complex pipelines with a single model.

The TLDR of this notebook:

Generate an image per PDF page using pdf2image and also extract the text using pypdf. For each page image, use ColPali to obtain the visual multi-vector embeddings Then we store colbert embeddings in Vespa and use the long-context variant where we represent the colbert embeddings per document with the tensor tensor(page{}, patch{}, v[128]). This enables us to use the PDF as the document (retrievable unit), storing the page embeddings in the same document.

We also store the base64 encoded image, and page meta data like title and url so that we can display it in the result page, but also use it for RAG with powerful LLMs with vision capabilities.

At query time, we retrieve using BM25 over all the text from all pages, then use the ColPali embeddings to rerank the results using the max page score."

Content

Title