Normal body text parsed as headers

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

285 stars 55 forks source link

Normal body text parsed as headers #143

Closed tanchangsheng closed 1 day ago

tanchangsheng commented 1 day ago

Normal body text words has been parsed as headers.

file: example.pdf

# IT Inventory System Tender Specifications

### Organization: XYZ Solutions Inc. Date: April 23, 2024

JorjMcKie commented 1 day ago

This is not a bug: The header identification algorithm determines the most frequent font size and sets it as the body text. Everything smaller will also be treated as body text. The maximum 6 font sizes will be treated as headers h1 - h6. Any font size larger than body text but smaller than font size of h6 will be treated like h6. We all know that this algorithm is an approximation of any document's truth. Use your own logic if you cannot agree with this approach.

tanchangsheng commented 21 hours ago

Thanks for the clarification!