Open vijayproxima opened 4 months ago
assign me this issue
@vijayproxima you can fix this in 3 ways 1.import tabula
pdf_file = 'your_pdf_file.pdf'
tables = tabula.read_pdf(pdf_file, pages='all', multiple_tables=True)
print(f'Total tables extracted: {len(tables)}')
for i, table in enumerate(tables): output_csv = f'outputtable{i}.csv' table.to_csv(output_csv, index=False) print(f'Table {i} saved to {output_csv}')
2. import camelot
pdf_file = 'your_pdf_file.pdf'
tables = camelot.read_pdf(pdf_file, pages='all')
print(f'Total tables extracted: {len(tables)}')
for i, table in enumerate(tables): output_csv = f'outputtable{i}.csv' table.to_csv(output_csv) print(f'Table {i} saved to {output_csv}')
3.import pdfplumber
pdf_file = 'your_pdf_file.pdf'
with pdfplumber.open(pdf_file) as pdf: for page_number, page in enumerate(pdf.pages):
tables = page.extract_tables()
# Check if any tables were found
if tables:
for i, table in enumerate(tables):
# Save each table as a CSV
output_csv = f'output_table_page_{page_number}_table_{i}.csv'
# Writing table to CSV
with open(output_csv, 'w') as f:
for row in table:
f.write(','.join(str(cell) for cell in row) + '\n')
print(f'Table {i} from page {page_number} saved to {output_csv}')
else:
print(f'No tables found on page {page_number}')
please let me if any of this is helpfull for your repositries
HI, In my pdf file, I have 4 tables [4 regions] for listing the holidays for a year. the tables has columns, Sr.No, Date, Day and Festival. The title on the table is Region Name Holiday List 2024. However, when i execute this line, there is no csv file being created nor the pdfdocs.jsonl file is created. it is just creating the data.jsonl file. def parsing_the_pdfs(): t0 = time.time()
Create a Library
p= parsing_the_pdfs() This is the output when I execute the code: Update: parsing time : 0.0057866573333740234 Update: parsing output : {'docs_added': 0, 'blocks_added': 0, 'images_added': 0, 'pages_added': 0, 'tables_added': 0, 'rejected_files': []}