pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.65k stars 460 forks source link

`Story.write_with_links` will ignore everything after the first "page break" in the HTML. #2753

Closed Julynx closed 7 months ago

Julynx commented 9 months ago

Describe the bug

Story.write and Story.write_with_links will ignore everything after the first <p style="page-break-after: always;"></p> in the html.

To Reproduce

Having the sample HTML content:

<p>Before</p>

<p style="page-break-after: always;"></p>

<p>After</p>
  1. Create a story from the HTML content

    story = fitz.Story(html=html_content,
                   archive=".")
  2. Write the story to the document and save it to a PDF file

    document = story.write_with_links(rectfn)
    document.save(file_path)
    document.close()
  3. The PDF contains only:

Page 1/1

Before

Expected behavior

Page 1/2

Before

Page 2/2

After

Page 1/1

Before
After

In no case should it just drop everything before the page break if the HTML happens to have one, which is the observed behavior for me.

Screenshots

Notice how the PDF just has "Before". There is no "After" text and only one page, besides the HTML having the "After" paragraph after the page break. The code used is the one detailed in "To Reproduce". Screenshot

Your configuration

julian-smith-artifex-com commented 8 months ago

Interestingly the After text actually appears to be in the PDF (e.g. from Page.get_text()) but is not displaying.

julian-smith-artifex-com commented 8 months ago

Have created MuPDF bug for this: https://bugs.ghostscript.com/show_bug.cgi?id=707323

julian-smith-artifex-com commented 8 months ago

This is now fixed in MuPDF master.

julian-smith-artifex-com commented 8 months ago

tests/test_story.py:test_2753() checks this bug is fixed in PyMuPDF-1.23.7 so marking this as fixed in the next release.

Julynx commented 8 months ago

That's awesome. Thanks, guys!

julian-smith-artifex-com commented 7 months ago

Fixed in 1.23.7.