py-pdf / pypdf_table_extraction

A Python library to extract tabular data from PDFs
https://pypdf-table-extraction.readthedocs.io
MIT License
38 stars 15 forks source link

How can I read the table that have started on page 1 and extends on multiple pages. #192

Open dejanmarkovic opened 2 days ago

dejanmarkovic commented 2 days ago

pypdf_table_extraction/camelot does not recognize the table on pages after page 1 with the lattice flavor.

With the stream method, I get a messed-up output like this one

   0            1            2                                  3                       4         5
0                                                                      2059001013453712313
1                               289 Transakcije po nalogu građana                    PBO:
2                                                                        MARY MILAN
3  5  12.05.2024.  12.05.2024.     n 9001013454849 III rata   maj                    PBZ:  1.600,00
4                                                                  KNEZ MILET 456 4 11
5                                                   Instant nalog            FT241123YJFB4
6                                                         Belgrade

This is the output from the lattice from page one which looks great

0  REDNI\nBROJ  DATUM\nPRIJEMA  DATUM\nIZVRŠENJA  ...  REFERENCA KLIJENTA\nREFERENCA PARTNERA\nREFERE...  NA TERET  U KORIST
1            1     11.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT201661TXR4            4.200,00
2            2     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20122CK6Y6            5.600,00
3            3     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20134Y5NWL            5.600,00
4            4     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20124QY6JZ            5.600,00

The document is a PDF bank statement. NOTE: I have randomized the numbers in the output for privacy and security purposes.

bosd commented 1 day ago

Are there 2 separate issues here?

1.

pypdf_table_extraction/camelot does not recognize the table on pages after page 1 with the lattice flavor.

This could be a bug.

  1. Merging tables which span multiple pages is afaik not a covered use case. Merging the tables can be done in post processing.

Have you tried the output with the Network parser?

dejanmarkovic commented 15 hours ago

With this code

`import pypdf_table_extraction

file_path = r"C:\Projects\temp123\attachments\test\er\er3.pdf"

flavors = ["hybrid", "lattice", "network", "stream"]

for flavor in flavors: print(f"\nTrying {flavor} flavor:") try: tables = pypdf_table_extraction.read_pdf( file_path, pages="all", flavor=flavor # Use the current flavor )

    print(f"Number of tables found: {len(tables)}")

    for i, table in enumerate(tables):
        print(f"\nTable {i} data:")
        print(table.df)

        csv_path = f"{flavor}_table_{i}.csv"
        table.df.to_csv(csv_path, index=False)
        print(f"Table {i} saved to {csv_path}")

    for i, table in enumerate(tables):
        print(f"\nParsing report for {flavor} Table {i}:")
        print(table.parsing_report)

except Exception as e:
    print(f"An error occurred with {flavor} flavor: {str(e)}")
    continue

print("\nTable extraction process completed.") ` I am getting the following errors:

  1. 'Trying hybrid flavor: An error occurred with hybrid flavor: Unknown flavor specified. Use either 'lattice' or 'stream''
  2. An error occurred with network flavor: Unknown flavor specified. Use either 'lattice' or 'stream'

NOTE: I have uninstalled the Camelot and pypdf_table_extraction and Installed again only pypdf_table_extraction library so there should be no conflicts or any other issues.

Can you please help/advise?

bosd commented 13 hours ago

Based on the following error message:

2. An error occurred with network flavor: Unknown flavor specified. Use either 'lattice' or 'stream'

It looks like somhow you are running an old code base. As of V0.0.2 the error message changed to:


        raise NotImplementedError(
            "Unknown flavor specified."
            " Use either 'lattice', 'stream', 'network' or 'hybrid'"
        )

Maybe uninstall both again. Then reinstall pypdf_table_exctraction. What is the output of pip show pypdf_table_exctraction or camelot --version

dejanmarkovic commented 13 hours ago

conda list

packages in environment at C:\Users\user.conda\envs\camelot_env:

Name Version Build Channel

beautifulsoup4 4.12.3 pypi_0 pypi bzip2 1.0.8 hcfcfb64_5 conda-forge ca-certificates 2024.8.30 h56e8100_0 conda-forge cachetools 5.5.0 pypi_0 pypi certifi 2024.8.30 pyhd8ed1ab_0 conda-forge cffi 1.17.1 pypi_0 pypi chardet 5.2.0 pypi_0 pypi charset-normalizer 3.4.0 pypi_0 pypi click 8.1.7 pypi_0 pypi colorama 0.4.6 pypi_0 pypi cryptography 43.0.3 pypi_0 pypi cssselect 1.2.0 pypi_0 pypi distro 1.9.0 pypi_0 pypi et-xmlfile 1.1.0 pypi_0 pypi ghostscript 0.7 pypi_0 pypi google 3.0.0 pypi_0 pypi google-api-core 2.19.2 pypi_0 pypi google-api-python-client 2.143.0 pypi_0 pypi google-auth 2.34.0 pypi_0 pypi google-auth-httplib2 0.2.0 pypi_0 pypi google-auth-oauthlib 1.2.1 pypi_0 pypi googleapis-common-protos 1.65.0 pypi_0 pypi httplib2 0.22.0 pypi_0 pypi icu 75.1 he0c23c2_0 conda-forge idna 3.8 pypi_0 pypi libabseil 20240116.2 cxx17_h63175ca_0 conda-forge libexpat 2.6.2 h63175ca_0 conda-forge libffi 3.4.2 h8ffe710_5 conda-forge libprotobuf 4.25.3 h503648d_0 conda-forge libsqlite 3.46.0 h2466b09_0 conda-forge libzlib 1.3.1 h2466b09_1 conda-forge lxml 5.2.2 pypi_0 pypi lz4-c 1.9.4 hcfcfb64_0 conda-forge mysql 9.0.1 h9c18f36_0 conda-forge mysql-client 9.0.1 h809f9c2_0 conda-forge mysql-common 9.0.1 h2224204_0 conda-forge mysql-connector-python 9.0.0 py312h275cf98_0 conda-forge mysql-devel 9.0.1 h2224204_0 conda-forge mysql-libs 9.0.1 h809f9c2_0 conda-forge mysql-server 9.0.1 h63c2bd3_0 conda-forge numpy 2.1.2 pypi_0 pypi oauthlib 3.2.2 pypi_0 pypi opencv-python 4.10.0.84 pypi_0 pypi openpyxl 3.1.5 pypi_0 pypi openssl 3.3.2 h2466b09_0 conda-forge pandas 2.2.3 pypi_0 pypi pdfminer-six 20240706 pypi_0 pypi pdfplumber 0.11.4 pypi_0 pypi pdfquery 0.4.3 pypi_0 pypi pillow 11.0.0 pypi_0 pypi pip 24.0 pyhd8ed1ab_0 conda-forge proto-plus 1.24.0 pypi_0 pypi protobuf 5.28.0 pypi_0 pypi pyasn1 0.6.0 pypi_0 pypi pyasn1-modules 0.4.0 pypi_0 pypi pycparser 2.22 pypi_0 pypi pymupdf 1.24.7 pypi_0 pypi pymupdfb 1.24.6 pypi_0 pypi pypdf 4.3.1 pypi_0 pypi pypdf-table-extraction 0.0.2 pypi_0 pypi pypdf2 2.11.1 pyhd8ed1ab_0 conda-forge pypdfium2 4.30.0 pypi_0 pypi pyquery 2.0.0 pypi_0 pypi python 3.12.4 h889d299_0_cpython conda-forge python-dateutil 2.9.0.post0 pypi_0 pypi python_abi 3.12 4_cp312 conda-forge pytz 2024.2 pypi_0 pypi pyyaml 6.0.2 pypi_0 pypi requests 2.32.3 pypi_0 pypi requests-oauthlib 2.0.0 pypi_0 pypi roman 4.2 pypi_0 pypi rsa 4.9 pypi_0 pypi setuptools 70.1.1 pyhd8ed1ab_0 conda-forge six 1.16.0 pypi_0 pypi soupsieve 2.6 pypi_0 pypi tabula-py 2.9.3 pypi_0 pypi tabulate 0.9.0 pypi_0 pypi tk 8.6.13 h5226925_1 conda-forge tzdata 2024.2 pypi_0 pypi ucrt 10.0.22621.0 h57928b3_0 conda-forge uritemplate 4.1.1 pypi_0 pypi urllib3 2.2.2 pypi_0 pypi vc 14.3 h8a93ad2_20 conda-forge vc14_runtime 14.40.33810 ha82c5b3_20 conda-forge vs2015_runtime 14.40.33810 h3bf8584_20 conda-forge wheel 0.43.0 pyhd8ed1ab_1 conda-forge xz 5.2.6 h8d14728_0 conda-forge zstd 1.5.6 h0ea2cb4_0 conda-forge