illuin-tech / vidore-benchmark

Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.
https://huggingface.co/vidore
MIT License
125 stars 13 forks source link

Reproducing the Results in Table 2 #51

Open roipony opened 2 weeks ago

roipony commented 2 weeks ago

The provided datasets have four variants, each serving a specific purpose, and contain a text_description as described below E.g gov:

  1. syntheticDocQA_government_reports_testNo text_description
  2. syntheticDocQA_government_reports_test_tesseract – Full-page OCR (including textual elements, figures, images, and tables)
  3. syntheticDocQA_government_reports_test_ocr_chunk – OCR/programmatic text extraction for textual elements and OCR applied to visuals
  4. syntheticDocQA_government_reports_test_captioning – OCR/programmatic text extraction for textual elements with VLM captioning on visuals

In Table 2, the Unstructured type experiments are described as:
"In our simplest Unstructured configuration (text-only), only textual elements are retained, while figures, images, and tables are treated as noise and filtered out" (Section 3.2).
However, the tesseract dataset variant performs full-page OCR, including both textual and visual elements.
The dataset variant most aligned with the Unstructured description is ocr_chunk, as it preserves only the text in the chunk_type column.

Can you please clarify how to reproduce the experiments reported in Table 2, specifically for the Unstructured and Unstructured+ types?

For example, running the following command (as described "If you want to evaluate a retriever that relies on pure-text retrieval (no visual embeddings), you should use the datasets from the..." in the README.md):

vidore-benchmark evaluate-retriever \
    --model-name BAAI/bge-m3 \
    --dataset-name vidore/syntheticDocQA_government_reports_test_tesseract \
    --split test

returns: NDCG@5 for BAAI/bge-m3 on vidore/syntheticDocQA_government_reports_test_tesseract: 0.82917

However, Table 2 in the paper reports a value of 77.7.

I assume this discrepancy occurs because the tesseract variant uses OCR for the entire page, including visual elements, whereas the Unstructured configuration processes only textual elements.

Thank you.

ManuelFay commented 4 days ago

@HuguesSib ?