issue with Heading since version 0.0.5

papipsycho commented 5 months ago

Hello,

we remark some issue the heading of the pdf was correct on the version 0.0.3 since the version 0.0.5 didn't have any heading anymore.

JorjMcKie commented 5 months ago

There is a new parameter margins=(0, 50, 0, 50) supported by the method. Its default assumes a top and bottom page border of 50 points each. If you use margins=0, you should get the previous behavior.

papipsycho commented 5 months ago

I've test, but unfortunatly, is the same issue.

i will try to create test pdf for you

JorjMcKie commented 5 months ago

Could not reproduce the problem. Here is an example demonstrating successful use. PDF look like this:

Default extraction output (i.e. margins=(0, 50, 0, 50)) omits the page header:

data = pymupdf4llm.to_markdown("v110-changes.pdf")
print(data[:500])

# Pixmap

The alpha channel is now optional. Its presence is controlled by a new boolean parameter (called `alpha` ). This
has the following consequences:

Setting margins to 0 delivers the full page:

data = pymupdf4llm.to_markdown("v110-changes.pdf", margins=0)
print(data[:500])

**MuPDF v1.10 Changes and their Implications for PyMuPDF**

# Pixmap

The alpha channel is now optional. Its presence is controlled by a new boolean parameter (called `alpha` ). This
has the following consequences:

JorjMcKie commented 5 months ago

closed for lack of response over an extended period of time

Elehiggle commented 5 months ago

I was wondering why my code wasn't functioning as expected anymore. I had test pdfs that contain text and images. All of a sudden it would not get text anymore. margins=0 fixed it. I understand that this is a very early package, but maybe semver or better documentation on breaking changes could be a good idea 👍 Test PDF: hey_image(1).pdf

pymupdf / RAG

issue with Heading since version 0.0.5 #30