jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.48k stars 658 forks source link

Extract tables not extracting particular format of tables #993

Closed John-Peter-R closed 1 year ago

John-Peter-R commented 1 year ago

I use extract_tables to extract the table but i could not extract the table it was able to extract only the first row . the other rows were not extracted

_WebTrust Principles and Criteria for Certification Authorities – SSL Baseline with Network Security – Version 2.7-12-14.pdf Attached pdf for refernce i could not find a way extract the tables in this pdf anywhere

samkit-jain commented 1 year ago

Hi @John-Peter-R Appreciate your interest in the library. The reason you're only seeing the header is that only the header row is completely made up of line objects. To get the complete table, you would need to use explicit strategy. For horizontal lines, use curves + edges and for the vertical lines, fetch from the header. Code will be

import pdfplumber

def get_vertical_lines_from_header(page):
    """
    In some PDFs, the header row is a table enclosed in lines-lines. Instead of defining
    custom hand drawn vertical line segments, use that table's coords for explicit vertical lines.
    """
    column_count = 3
    # Find tables.
    tables = page.find_tables(table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines"})

    # No table found.
    if len(tables) == 0:
        return []

    cells = None  # To store cell info for the header row.

    # Find header row. It is the first row with no Nones and of length "column_count" in all the tables.
    for table in tables:
        for row in table.rows:
            if any(cell is None for cell in row.cells):
                continue

            if len(row.cells) != column_count:
                continue

            return [cell[0] for cell in row.cells] + [row.cells[-1][2]]

        if cells is not None:
            break

    return []

pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]

# Table settings.
ts = {
    "vertical_strategy": "explicit",
    "horizontal_strategy": "explicit",
    "explicit_vertical_lines": get_vertical_lines_from_header(page),
    "explicit_horizontal_lines": page.curves + page.edges,
}

# Debug visually.
image = page.to_image(resolution=200)
image.reset().debug_tablefinder(ts)
image.save("image.png", format="PNG")

# Extract table.
tables = page.extract_tables(table_settings=ts)
for table in tables:
    print()
    for row in table:
        print(row)

The result will be image

['', '', '']
['#', 'Criterion', 'Ref4']
['2.5', 'The CA maintains controls to provide reasonable assurance that\nthe extensions, key sizes, and certificate policy identifiers (including\nReserved Certificate Policy Identifiers) of Subscriber certificates\ngenerated conform to the Baseline Requirements.', '7.1.2.3,\n6.1.5, 7.1.6,\n7.1.6.4']
['2.6', 'The CA maintains controls to provide reasonable assurance that with\nexception to the requirements stipulated in the Baseline Requirements\nSections 7.1.2.1, 7.1.2.2, and 7.1.2.3, all other fields and extensions of\ncertificates generated are set in accordance with RFC 5280.', '7.1.2.4']
['2.7', 'The CA maintains controls to provide reasonable assurance that the\nvalidity period of Subscriber certificates issued does not exceed the\nmaximum as specified in the Baseline Requirements.', '6.3.2']
['2.8', 'The CA maintains controls to provide reasonable assurance that it does\nnot issue certificates with extensions that do not apply in the context of\nthe public Internet, unless:\na. Such values fall within an OID arc for which the Applicant\ndemonstrates ownership; or\nb. The Applicant can otherwise demonstrate the right to assert the\ndata in public context.', '7.1.2.4']
['2.9', 'The CA maintains controls to provide reasonable assurance that it does\nnot issue certificates with semantics that, if included, will mislead a\nRelying Party about the certificate information verified by the CA.', '7.1.2.4']
['2.10', 'The CA maintains controls to provide reasonable assurance that it does\nnot issue any new Subscriber or Subordinate CA certificates using the\nSHA-1 hash algorithm.', '7.1.3']
['2.11', 'The CA maintains controls to provide reasonable assurance that the\ncontent of the Certificate Issuer Distinguished Name field matches the\nSubject DN of the Issuing CA to support a valid Certification Path in\naccordance with Section 7.1.4.1 of the SSL Baseline Requirements.', '7.1.4.1']