syntheticDocQA_government_reports_test_ocr_chunk – OCR/programmatic text extraction for textual elements and OCR applied to visuals
syntheticDocQA_government_reports_test_captioning – OCR/programmatic text extraction for textual elements with VLM captioning on visuals
In Table 2, the Unstructured type experiments are described as: "In our simplest Unstructured configuration (text-only), only textual elements are retained, while figures, images, and tables are treated as noise and filtered out" (Section 3.2).
However, the tesseract dataset variant performs full-page OCR, including both textual and visual elements.
The dataset variant most aligned with the Unstructured description is ocr_chunk, as it preserves only the text in the chunk_type column.
Can you please clarify how to reproduce the experiments reported in Table 2, specifically for the Unstructured and Unstructured+ types?
For example, running the following command (as described "If you want to evaluate a retriever that relies on pure-text retrieval (no visual embeddings), you should use the datasets from the..." in the README.md):
returns:
NDCG@5 for BAAI/bge-m3 on vidore/syntheticDocQA_government_reports_test_tesseract: 0.82917
However, Table 2 in the paper reports a value of 77.7.
I assume this discrepancy occurs because the tesseract variant uses OCR for the entire page, including visual elements, whereas the Unstructured configuration processes only textual elements.
The provided datasets have four variants, each serving a specific purpose, and contain a
text_description
as described below E.g gov:In Table 2, the Unstructured type experiments are described as:
"In our simplest Unstructured configuration (text-only), only textual elements are retained, while figures, images, and tables are treated as noise and filtered out" (Section 3.2).
However, the
tesseract
dataset variant performs full-page OCR, including both textual and visual elements.The dataset variant most aligned with the Unstructured description is ocr_chunk, as it preserves only the text in the
chunk_type
column.Can you please clarify how to reproduce the experiments reported in Table 2, specifically for the Unstructured and Unstructured+ types?
For example, running the following command (as described "If you want to evaluate a retriever that relies on pure-text retrieval (no visual embeddings), you should use the datasets from the..." in the
README.md
):returns:
NDCG@5 for BAAI/bge-m3 on vidore/syntheticDocQA_government_reports_test_tesseract: 0.82917
However, Table 2 in the paper reports a value of 77.7.
I assume this discrepancy occurs because the tesseract variant uses OCR for the entire page, including visual elements, whereas the Unstructured configuration processes only textual elements.
Thank you.