`TypeError` raised by `extract_text` method with compressed PDF file

jbpenrath commented 1 year ago

Bug report

Description

I'm generating PDF document through Weasyprint. Since the version 59.0 of this package, I'm not able to extract text from generated compressed PDF files with pdfminer.highlevel.extract_text method. Indeed this method raises a TypeError, invalid length. The exception is raised from a util method called nunpack.

So I first open an issue on the Weasyprint repository, but it appears the issue's source could be come from pdfminer itself.

You can take a look to the answer of Weasyprint maintainer, to understand pdfminer concern in this problem.

Steps to reproduce

from io import BytesIO
from pdfminer.high_level import extract_text
from weasyprint import HTML

html = HTML(string='<h1>Hello world</h1>')
document = html.write_pdf()
extract_text(BytesIO(document)) # 💥 TypeError: invalid length: 6

liZe commented 1 year ago

Here’s a simple and uncompressed PDF to reproduce the problem, in case you’d like to avoid installing another tool 😄: hello.pdf

The error is caused by the XRef table with /W [1 4 6]. The third field is encoded using 6 bytes, and it’s decoded here using nunpack that’s not designed to handle all integer sizes.

Instead of using struct.unpack in nunpack, it may be useful to use int.from_bytes that will automatically work for all integer sizes.

dhdaines commented 1 month ago

fixed in #1029 (and thank you for weasyprint, it is very nice software!)

pdfminer / pdfminer.six

`TypeError` raised by `extract_text` method with compressed PDF file #886

Description

Steps to reproduce