jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Extracting table spanning 2 pages #550

Closed SamGoodin closed 2 years ago

SamGoodin commented 2 years ago

Describe the bug

Hey. I'm attempting to extract a lot of tables from a cyber insurance document. One of the important tables I need spans across two pages. I'm able to iterate through the pages and extract all the other tables using the extract_tables function, since they all start and end on the same page, except for the table that spans 2 pages. I attempted to iterate through each table instead, and print the items there, however the text I need from the specific table comes up as a bunch of (cid:xx). Is there a workaround for extracting a table spanning 2 pages?

The table in question is the Endorsements table, starting on page 2 and ending on page 3.

cyber-insurance-policy-1.pdf

The following is the result of using the extract_tables method on the first 3 pages: [[['Company Name: National Specialty Insurance Company, a stock company \n1900 L. Don Dodson Drive, Ste. 1109 \nBedford, TX 76021'], ['Producer Name: Trava \n123 Test Street \nTest, VT 12345'], ['Named Insured: \nTrava'], ['Policy Number: \nBLU-TRC-L6J6JZZJV'], ['Mailing Address: \n830 Massachusetts Ave \nIndianapolis, IN 46204'], ['Policy Period: \nFrom: Dec 02, 2021 \nTo: Dec 02, 2022 \n12:01 AM standard time at the Insured’s mailing address shown above'], ['Web Site Address(es):'], ['Type Of Business Organization (Check appropriate box): \n(cid:1)Private Public\nInvestment Fund Not for Profit\nGovernment']], [['(cid:1)'], [''], ['']], [[''], ['']]] [[['Annual Premium: $1,100'], ['Fees/Assessment: \n$4.40 \nThe Policy Fee disclosed here is included in the Annual Premium above'], ['Policy Aggregate Limit of Insurance: $1,000,000'], ['Policy Deductible Amount: $10,000'], ['Business Income and Extra Expense Time Deductible: 6 Hours'], ['Social Engineering Coverage Limit: $50,000'], ['Social Engineering Deductible: $10,000'], ['Retroactive Date: Dec 02, 2020'], ['Coverage\nAggregate Sublimit(s) of Insurance: Percentage Limit\nDollars\nInsuring Agreement 2. Extortion Threat – Ransom Payments 5% $50,000 \nInsuring Agreement 4. Business Income and Extra Expense 50% $500,000 \nInsuring Agreement 5. Public Relations Expense 5% $50,000 \nInsuring Agreement 7. Ransom Payments $50,000']]] []

samkit-jain commented 2 years ago

Hi @SamGoodin Appreciate your interest in the library. Unfortunately, this is not supported out of the box by pdfplumber. You will have to extract the tables from both the pages and then combine them.

jsvine commented 2 years ago

Hi @SamGoodin and thanks @samkit-jain. As this isn't a feature we're likely to implement soon, I'm closing this issue.