illuin-tech / colpali

The code used to train and run inference with the ColPali architecture.
https://huggingface.co/vidore
MIT License
851 stars 75 forks source link

A question related to the dataset refactor #70

Closed XMHZZ2018 closed 3 weeks ago

XMHZZ2018 commented 3 weeks ago

Thanks team for the great work! I have a quick question about refactoring Document VQA into a retrieval task. For queries like "What is the table number?" or "What is plotted along the x-axis?", how can we transform these into a meaningful retrieval setup? I might be missing something here. Thanks!

ManuelFay commented 3 weeks ago

Hey ! Yeah theses questions in DocVQA will be fairly impossible to get right for sure (although a good model will make sure top retrieved documents at least contain a plot with an x-axis and that the question is relevant for instance). In the majority of the cases, questions are much less ambiguous so we thought there was still a clear and interesting signal with this type of repurposed dataset. One thing you can do (to filter questions that are too ambiguous) is to look at if a given doc is within the top 100 of the retrieved contexts for a given question, and if it's not, you can consider the question was impossible and filter the pair out.