pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.02k stars 483 forks source link

`Page.get_text` results in `AssertionError` for epub files #3687

Closed arun-mani-j closed 1 month ago

arun-mani-j commented 1 month ago

Description of the bug

Page.get_text results in AssertionError for all options except "blocks" and "words" in epub files. However, directly accessing the methods from TextPage works fine.

This is there only in 1.24.7 I think. My distribution package of 1.23.7 does not cause this error.

How to reproduce the bug

  1. Download an epub file, I was able to reproduce the bug with https://www.gutenberg.org/ebooks/73987 for context.
  2. Run the following code.
    
    import pymupdf

doc = pymupdf.open("/home/arun-mani-j/Downloads/test.epub")

p = doc[0]

p.get_text("text")

AssertionError Traceback (most recent call last) ----> 1 p.get_text("text")

~/Projects/aayra/lib/python3.12/site-packages/pymupdf/utils.py in ?(page, option, clip, flags, textpage, sort, delimiters) 794 if clip is not None: 795 clip = pymupdf.Rect(clip) 796 cb = None 797 elif type(page) is pymupdf.Page: --> 798 cb = page.cropbox 799 # pymupdf.TextPage with or without images 800 tp = textpage 801 #pymupdf.exception_info()

~/Projects/aayra/lib/python3.12/site-packages/pymupdf/init.py in ?(self) 8531 @property 8532 def cropbox(self): 8533 """The CropBox.""" 8534 CheckParent(self) -> 8535 page = self._pdf_page() 8536 if not page.m_internal: 8537 val = mupdf.fz_bound_page(self.this) 8538 else:

~/Projects/aayra/lib/python3.12/site-packages/pymupdf/init.py in ?(self) 8050 def _pdf_page(self): -> 8051 return _as_pdf_page(self.this)

~/Projects/aayra/lib/python3.12/site-packages/pymupdf/init.py in ?(page, required) 333 return page 334 elif isinstance(page, mupdf.FzPage): 335 ret = mupdf.pdf_page_from_fz_page(page) 336 if required: --> 337 assert ret.m_internal 338 return ret 339 elif page is None: 340 assert 0, f'page is None'

AssertionError:

3. Using `TextPage` methods directly works fine.
```python
tp = p.get_textpage()
tp.extractText() # No errors raised
  1. Using "words" or "blocks" work fine.
    p.get_text("words")
    p.get_text("blocks")

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.12

julian-smith-artifex-com commented 1 month ago

Thanks for the report.

I think this bug has already been fixed in git, but i'll check on Monday.

julian-smith-artifex-com commented 1 month ago

Have just confirmed, this is working in git, so i'll mark this issue as being fixed in next release.

arun-mani-j commented 1 month ago

Thanks!

julian-smith-artifex-com commented 1 month ago

Fixed in 1.24.8.