vijayproxima commented 4 months ago

HI, In my pdf file, I have 4 tables [4 regions] for listing the holidays for a year. the tables has columns, Sr.No, Date, Day and Festival. The title on the table is Region Name Holiday List 2024. However, when i execute this line, there is no csv file being created nor the pdfdocs.jsonl file is created. it is just creating the data.jsonl file. def parsing_the_pdfs(): t0 = time.time()

Create a Library

LLMWareConfig().set_active_db("sqlite")

lib = Library().create_new_library("pdfdocs")
#parse and extract all of the contents from these documents
# Add file to the library
parsing_output = lib.add_files(input_folder_path=input_data)

print("Update: parsing time :", time.time() -t0)
print("Update: parsing output :", parsing_output)
#export all of the content of the library into jsonl files with metadata
output1 = lib.export_library_to_jsonl_file(output_data, "data.jsonl")
# export all of the tables
output2 = Query(lib).export_all_tables(query="Holiday", output_fp=output_data)

return 0

p= parsing_the_pdfs() This is the output when I execute the code: Update: parsing time : 0.0057866573333740234 Update: parsing output : {'docs_added': 0, 'blocks_added': 0, 'images_added': 0, 'pages_added': 0, 'tables_added': 0, 'rejected_files': []}

noman1321 commented 3 days ago

assign me this issue

noman1321 commented 3 days ago

@vijayproxima you can fix this in 3 ways 1.import tabula

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Extract tables from the PDF (all pages)

tables = tabula.read_pdf(pdf_file, pages='all', multiple_tables=True)

Check how many tables were extracted

print(f'Total tables extracted: {len(tables)}')

Export the tables to CSV files

for i, table in enumerate(tables): output_csv = f'outputtable{i}.csv' table.to_csv(output_csv, index=False) print(f'Table {i} saved to {output_csv}')

noman1321 commented 3 days ago

2. import camelot

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Extract tables from all pages of the PDF

tables = camelot.read_pdf(pdf_file, pages='all')

Check how many tables were extracted

print(f'Total tables extracted: {len(tables)}')

Export each extracted table to a separate CSV file

for i, table in enumerate(tables): output_csv = f'outputtable{i}.csv' table.to_csv(output_csv) print(f'Table {i} saved to {output_csv}')

noman1321 commented 3 days ago

3.import pdfplumber

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Open the PDF with pdfplumber

with pdfplumber.open(pdf_file) as pdf: for page_number, page in enumerate(pdf.pages):

Extract tables from the page

    tables = page.extract_tables()

    # Check if any tables were found
    if tables:
        for i, table in enumerate(tables):
            # Save each table as a CSV
            output_csv = f'output_table_page_{page_number}_table_{i}.csv'

            # Writing table to CSV
            with open(output_csv, 'w') as f:
                for row in table:
                    f.write(','.join(str(cell) for cell in row) + '\n')

            print(f'Table {i} from page {page_number} saved to {output_csv}')
    else:
        print(f'No tables found on page {page_number}')

please let me if any of this is helpfull for your repositries

llmware-ai / llmware

Tables in pdf not getting saved into csv file #824

Create a Library

Path to your PDF file

Extract tables from the PDF (all pages)

Check how many tables were extracted

Export the tables to CSV files

Path to your PDF file

Extract tables from all pages of the PDF

Check how many tables were extracted

Export each extracted table to a separate CSV file

Path to your PDF file

Open the PDF with pdfplumber

Extract tables from the page