pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

issue with Heading since version 0.0.5 #30

Closed papipsycho closed 5 months ago

papipsycho commented 5 months ago

Hello,

we remark some issue the heading of the pdf was correct on the version 0.0.3 since the version 0.0.5 didn't have any heading anymore.

JorjMcKie commented 5 months ago

There is a new parameter margins=(0, 50, 0, 50) supported by the method. Its default assumes a top and bottom page border of 50 points each. If you use margins=0, you should get the previous behavior.

papipsycho commented 5 months ago

I've test, but unfortunatly, is the same issue.

i will try to create test pdf for you

JorjMcKie commented 5 months ago

Could not reproduce the problem. Here is an example demonstrating successful use. PDF look like this:

image

Default extraction output (i.e. margins=(0, 50, 0, 50)) omits the page header:

data = pymupdf4llm.to_markdown("v110-changes.pdf")
print(data[:500])

# Pixmap

The alpha channel is now optional. Its presence is controlled by a new boolean parameter (called `alpha` ). This
has the following consequences:

Setting margins to 0 delivers the full page:

data = pymupdf4llm.to_markdown("v110-changes.pdf", margins=0)
print(data[:500])

**MuPDF v1.10 Changes and their Implications for PyMuPDF**

# Pixmap

The alpha channel is now optional. Its presence is controlled by a new boolean parameter (called `alpha` ). This
has the following consequences:
JorjMcKie commented 5 months ago

closed for lack of response over an extended period of time

Elehiggle commented 5 months ago

I was wondering why my code wasn't functioning as expected anymore. I had test pdfs that contain text and images. All of a sudden it would not get text anymore. margins=0 fixed it. I understand that this is a very early package, but maybe semver or better documentation on breaking changes could be a good idea 👍 Test PDF: hey_image(1).pdf