py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.37k stars 1.41k forks source link

Text visitor example in docs does not work #2881

Open lucasgadams opened 1 month ago

lucasgadams commented 1 month ago

I am trying to figure out how to extract text based on line coordinates, and using the example from here https://github.com/py-pdf/pypdf/blob/main/docs/user/extract-text.md#example-1-ignore-header-and-footer with the example document. However that does not seem to work. The y coordinates visited don't seem correct at all, or at least I dont understand what they mean. Is the example provided no longer how the code works? Or is something broken. The actual extracted text looks correct to me, but not the visitor.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
macOS-14.5-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.0.0, crypt_provider=('cryptography', '43.0.1'), PIL=10.4.0

Code + PDF

In [7]: from pypdf import PdfReader
   ...:
   ...: reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
   ...: page = reader.pages[3]
   ...:
   ...: parts = []
   ...:
   ...:
   ...: def visitor_body(text, cm, tm, font_dict, font_size):
   ...:     y = cm[5]
   ...:     if 50 < y < 720:
   ...:         parts.append(text)
   ...:         print(f"Adding text within coordinates: {text}")
   ...:     else:
   ...:         print(f"Skipping text out of range: {y}")
   ...:
   ...:
   ...: extracted_text = page.extract_text(visitor_text=visitor_body)
   ...: text_body = "".join(parts)
   ...: print(f"Size extracted text: {len(extracted_text)}")
   ...: print(f"Size visited text: {len(text_body)}")
   ...:
   ...:
   ...:
   ...:
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Skipping text out of range: 0.0
Size extracted text: 1814
Size visited text: 0

The PDF used is the one in the example, https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf

stefan6419846 commented 1 month ago

Thanks for the report. This is no issue with version 5.0, but apparently has already been broken in version 3. Using tm instead of cm fixes it in this case.

lucasgadams commented 1 month ago

Great thanks for looking into it. Can you briefly explain to me how these matrices should be used? I've ready the docs but I am honestly still a bit confused. The docs here say "It is recommended to use the user_matrix as it takes into all transformations." (user_matrix which seems to also be called cm). Then a bit later it says:

If you want to get the full transformation from text to user space, you can use the mult function (available in global import) as follows: txt2user = mult(tm, cm)). The font size is the raw text size and affected by the user_matrix.

And then here you are suggesting that we should actually be using the tm matrix and not the cm matrix at all?

My goal is that I can extract text from a PDF and know what the bounding box coordinates are in pdf User Space. For example, pymupdf has get text blocks method which returns bbox coordinates. What would be the equivalent in pypdf?

stefan6419846 commented 1 month ago

In this specific case (for the PDF given), mult(tm, cm) should be equivalent to tm as far as I remember. Thus using tm in this case would work, but mult(tm, cm) is better.

AFAIK there is no way to get the bounding boxes at the moment, just the "reference position" from the visitors. To get full bounding boxes, you would have to further work with the font properties.

lucasgadams commented 1 month ago

Got it, sounds like this library is not a good fit for my use case, and pdfminer might be better. Just for my knowledge, where would you say pypdf excels vs other open source python pdf libraries? What is the intended use case?

stefan6419846 commented 1 month ago

I consider pypdf the liberal licensed PDF library written in pure Python for reading, modifying and writing PDF files. This includes handling metadata, doing (basic) text extraction, extracting images, filling forms, adding watermarks and backgrounds to pages, removing or adding pages (including merging), transforming pages, ...

Depending on your use-case, other libraries might be a better fit at the moment, which I am not going to deny. Working with signed PDF files like in pyhanko, extracting character-level data like in pdfminer.six or MuPDF CLI, rendering pages to images like with poppler is not supported and are common cases where I rely on other tools as well.