Open martynjlewis opened 2 months ago
I've come across nice paddle models that can extract tables from pdf's and save them into a .csv file. Follow this link paddle. I hope it helps ;)
I've not used pdfminer, but used pdfplumber to do so. Attached the code below! It basically checks for tables across pages, and combines them if they are the same cell
def pdf_to_csv(pdf_path):
"""Extract tables using PDFPlumber and combine rows with empty values."""
csv_content = [] # To store CSV content as a list of rows
with pdfplumber.open(pdf_path) as pdf:
prev_row = None # To store the previous row
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
row = [prune_text(cell) if cell else "" for cell in row]
if prev_row is None:
prev_row = row
continue
# Determine if the row is a continuation (any cell empty)
if any(cell is None or cell.strip() == "" for cell in row):
# Merge non-empty cells with last_row
prev_row = [
(
(last_cell + " " + cell.strip())
if cell and cell.strip() != ""
else last_cell
)
for last_cell, cell in zip(prev_row, row)
]
else:
# Write the completed last_row and update it
csv_content.append(prev_row)
prev_row = row
# Optionally, add an empty row between tables
csv_content.append([])
# Write the final row after all pages are processed
if prev_row:
csv_content.append(prev_row)
# Add an empty row between tables
return str(
csv_content
)
There's also a CID issue that popups for unrecognised charachters which is what prune_text does
def prune_text(text):
"""
Replace (cid:x) patterns in the text with corresponding characters.
Args:
text (str): The input text containing (cid:x) patterns.
Returns:
str: The processed text with (cid:x) replaced.
"""
def replace_cid(match):
cid_num = int(match.group(1))
# Define specific CID to character mappings
cid_mapping = {
0: "- ", # Example: (cid:0) to bullet point
# Add more mappings as needed
# e.g., 66: 'B', etc.
}
try:
return cid_mapping.get(
cid_num, chr(cid_num)
) # Return mapped char or empty string if not found
except:
return ""
# Regular expression to find all (cid:x) patterns
cid_pattern = re.compile(r"\(cid:(\d+)\)")
pruned_text = re.sub(cid_pattern, replace_cid, text)
return pruned_text
Hi all
I have tried using pdfminer.six to extract a table from a pdf to a csv file to use in Excel but have been unsuccessful so far; I either get each entry on a separate line or I get each heading, then the corresponding cell but they run vertically rather than horizontally. I've attached the pdf I created to test and the resulting output.
Can anyone help please?
test3.pdf test3.csv