jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

Multiple letters extracted on PDF table by using extract_text #1155

Closed ervinwirth closed 3 months ago

ervinwirth commented 3 months ago

Describe the bug

I have this PDF, table on the third page (Hungarian): image

The extracted text: SZÁMLARÉSZLETEZŐ Elszámolt mennyiség: ivóvíz szolgáltatás: 0 m3, szennyvízelvezetés- és tisztítás: 0 m3 Árszabás: Közületi TTéétteell mmeeggnneevveezzééssee EEllsszzáámmoolltt iiddőősszzaakk MMéérrőőáállllááss EEllsszzáámmoolltt NNeettttóó eeggyyssééggáárr ééss NNeettttóó ddííjj ((FFtt)) ÁÁFFAA BBrruuttttóó ddííjj ((FFtt)) ((iinndduullóó,, zzáárróó)) mmeennnnyyiisséégg ééss mméérrttéékkeeggyysséégg ((%%)) mméérrttéékkeeggyysséégg IIvvóóvvíízz--sszzoollggáállttaattááss aallaappddííjj vvaaggyy 22002233..1122..0011--22002233..1122..3311 11 hhóó 440000,,0000 FFtt//hhóó 440000 2277 550088 ááttaalláánnyy AA110099992255 sszzáámmúú vvíízzmméérrőőnn mméérrtt 22002233..1122..0011--22002233..1122..3311 00 mm33 443300,,0000 FFtt//mm33 00 2277 00 iivvóóvvíízz ffooggyyaasszzttáássssaall aarráánnyyooss ddííjj -- ÁÁttllaaggffooggyyaasszzttááss SSzzeennnnyyvvíízzeellvveezzeettééss ééss ttiisszzttííttááss aallaappddííjj 22002233..1122..0011--22002233..1122..3311 11 hhóó 440000,,0000 FFtt//hhóó 440000 2277 550088 vvaaggyy ááttaalláánnyy EEllvveezzeetteetttt mmeennnnyyiissééggggeell aarráánnyyooss 22002233..1122..0011--22002233..1122..3311 00 mm33 339955,,0000 FFtt//mm33 00 2277 00 sszzeennnnyyvvíízzddííjj -- AA110099992255 sszzáámmúú mméérrőő -- ÁÁttllaaggffooggyyaasszzttááss ÁÁtthháárrííttootttt vvíízztteerrhheellééssii ddííjj -- AA110099992255 22002233..1122..0011--22002233..1122..3311 00 mm33 88,,0000 FFtt//mm33 00 2277 00 sszzáámmúú mméérrőő --ÁÁttllaaggffooggyyaasszzttááss Kerekítés (Ft) 0 Bruttó számlaérték összesen**: 1 016 Fizetendő összeg: 1 016 ÁFA összesítő (Ft) Nettó (Ft) ÁFA (%) ÁFA (Ft) Bruttó (Ft) 800 27 216 1 016

As you can see there are multiple letters: mmeeggnneevveezzééssee instead of "megnevezése"

Have you tried repairing the PDF?

I have tried it, installed Ghostscript, but I got an error: 2024-06-17 11:10:52,550 - ERROR - An unexpected error occurred: GPL Ghostscript 10.03.1 (2024-05-02) Copyright (C) 2024 Artifex Software, Inc. All rights reserved. This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: see the file COPYING for details. Error: /undefined in endobj Operand stack: Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1949 1 3 %oparray_pop 1948 1 3 %oparray_pop 1933 1 3 %oparray_pop 1803 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- Dictionary stack: --dict:750/1123(ro)(G)-- --dict:0/20(G)-- --dict:85/200(L)-- Current allocation mode is local Last OS error: No such file or directoryGPL Ghostscript 10.03.1: Unrecoverable error, exit code 1 Traceback (most recent call last): File "C:\Users\wirth.ervin\Repos\pdf-processing-app\pdf_processing_project\pdf_processing_app\views.py", line 197, in upload_pdf with pdfplumber.open(pdf_file, repair=True, gs_path=gs_path) as pdf: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wirth.ervin\Repos\pdf-processing-app\venv\Lib\site-packages\pdfplumber\pdf.py", line 80, in open stream = _repair(path_or_fp, password=password, gs_path=gs_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wirth.ervin\Repos\pdf-processing-app\venv\Lib\site-packages\pdfplumber\repair.py", line 55, in _repair raise Exception(f"{stderr.decode('utf-8')}") Exception: GPL Ghostscript 10.03.1 (2024-05-02) Copyright (C) 2024 Artifex Software, Inc. All rights reserved. This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: see the file COPYING for details. Error: /undefined in endobj Operand stack: Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1949 1 3 %oparray_pop 1948 1 3 %oparray_pop 1933 1 3 %oparray_pop 1803 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- Dictionary stack: --dict:750/1123(ro)(G)-- --dict:0/20(G)-- --dict:85/200(L)-- Current allocation mode is local Last OS error: No such file or directoryGPL Ghostscript 10.03.1: Unrecoverable error, exit code 1

I repaired the file with IlovePDF, but I got the same error.

Code to reproduce the problem

                with pdfplumber.open(pdf_file) as pdf:
                    for page in pdf.pages:
                        text_all_pages += page.extract_text()

PDF file

Redacted file attached. example_redacted.pdf

Environment

jsvine commented 3 months ago

Have you tried using page.dedupe_chars().extract_text()? Some details on the dedupe_chars(...) method here: https://github.com/jsvine/pdfplumber?tab=readme-ov-file#extracting-text

ervinwirth commented 3 months ago

Hmm, it solves the issues. Not sure how: "Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within tolerance x/y) as other characters — removed."

There was an error with the PDF?

jsvine commented 3 months ago

Great, thanks for checking/confirming. Not exactly an "error" (which I'd associate more with incorrect encodings) and more of a redundancy or quirk. Some PDFs (for reasons sometimes unknown), such as this one, write multiple instances of the same character. This is relatively rare, but common enough that it seemed worth adding the .dedupe_chars(...) method.

ervinwirth commented 3 months ago

Thank you :), is it possible to ask you about another 'interesting' case?

jsvine commented 3 months ago

Yes! Feel free to open a discussion or feature-request-tagged issue.