pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Multiple lines parsed as single line #144

Closed tanchangsheng closed 2 months ago

tanchangsheng commented 2 months ago

Words across two lines parsed as single line.

file: example.pdf

Expected output:

**Organization:** XYZ Solutions Inc.  
**Date:** April 23, 2024

Actual Output

### Organization: XYZ Solutions Inc. Date: April 23, 2024
image
JorjMcKie commented 2 months ago

Not a bug. There is no indication whatsoever from which to conclude a paragraph break. So the logic behaves correctly in assuming continuous text.

JorjMcKie commented 2 months ago

Also that font size obviously fell in the range of header font sizes. If you do not like that logic, either switch it off (hdr_info=False) altogether or supply your own. In any case: the 2 lines follow each other with no extraneous spacing, therefore no extra line break is being generated.

tanchangsheng commented 2 months ago

Thanks for the clarification!