pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.95k stars 930 forks source link

Is there a way to extract a table from a PDF to CSV? #1033

Open martynjlewis opened 2 months ago

martynjlewis commented 2 months ago

Hi all

I have tried using pdfminer.six to extract a table from a pdf to a csv file to use in Excel but have been unsuccessful so far; I either get each entry on a separate line or I get each heading, then the corresponding cell but they run vertically rather than horizontally. I've attached the pdf I created to test and the resulting output.

Can anyone help please?

test3.pdf test3.csv

devwasabi2 commented 2 months ago

I've come across nice paddle models that can extract tables from pdf's and save them into a .csv file. Follow this link paddle. I hope it helps ;)

Some1Somewhere commented 1 week ago

I've not used pdfminer, but used pdfplumber to do so. Attached the code below! It basically checks for tables across pages, and combines them if they are the same cell

def pdf_to_csv(pdf_path):
    """Extract tables using PDFPlumber and combine rows with empty values."""
    csv_content = []  # To store CSV content as a list of rows
    with pdfplumber.open(pdf_path) as pdf:
        prev_row = None  # To store the previous row

        for page in pdf.pages:
            tables = page.extract_tables()
            for table in tables:
                for row in table:
                    row = [prune_text(cell) if cell else "" for cell in row]
                    if prev_row is None:
                        prev_row = row
                        continue

                    # Determine if the row is a continuation (any cell empty)
                    if any(cell is None or cell.strip() == "" for cell in row):
                        # Merge non-empty cells with last_row
                        prev_row = [
                            (
                                (last_cell + " " + cell.strip())
                                if cell and cell.strip() != ""
                                else last_cell
                            )
                            for last_cell, cell in zip(prev_row, row)
                        ]
                    else:
                        # Write the completed last_row and update it
                        csv_content.append(prev_row)
                        prev_row = row
                # Optionally, add an empty row between tables
                csv_content.append([])

            # Write the final row after all pages are processed
            if prev_row:
                csv_content.append(prev_row)
        # Add an empty row between tables
    return str(
        csv_content
    )

There's also a CID issue that popups for unrecognised charachters which is what prune_text does

   def prune_text(text):
    """
    Replace (cid:x) patterns in the text with corresponding characters.

    Args:
        text (str): The input text containing (cid:x) patterns.

    Returns:
        str: The processed text with (cid:x) replaced.
    """

    def replace_cid(match):
        cid_num = int(match.group(1))
        # Define specific CID to character mappings
        cid_mapping = {
            0: "- ",  # Example: (cid:0) to bullet point
            # Add more mappings as needed
            # e.g., 66: 'B', etc.
        }
        try:

            return cid_mapping.get(
                cid_num, chr(cid_num)
            )  # Return mapped char or empty string if not found
        except:
            return ""

    # Regular expression to find all (cid:x) patterns
    cid_pattern = re.compile(r"\(cid:(\d+)\)")
    pruned_text = re.sub(cid_pattern, replace_cid, text)
    return pruned_text