llmware-ai / llmware

Unified framework for building enterprise RAG pipelines with small, specialized models
https://llmware-ai.github.io/llmware/
Apache License 2.0
4.63k stars 851 forks source link

Tables in pdf not getting saved into csv file #824

Open vijayproxima opened 4 months ago

vijayproxima commented 4 months ago

HI, In my pdf file, I have 4 tables [4 regions] for listing the holidays for a year. the tables has columns, Sr.No, Date, Day and Festival. The title on the table is Region Name Holiday List 2024. However, when i execute this line, there is no csv file being created nor the pdfdocs.jsonl file is created. it is just creating the data.jsonl file. def parsing_the_pdfs(): t0 = time.time()

Create a Library

LLMWareConfig().set_active_db("sqlite")

lib = Library().create_new_library("pdfdocs")
#parse and extract all of the contents from these documents
# Add file to the library
parsing_output = lib.add_files(input_folder_path=input_data)

print("Update: parsing time :", time.time() -t0)
print("Update: parsing output :", parsing_output)
#export all of the content of the library into jsonl files with metadata
output1 = lib.export_library_to_jsonl_file(output_data, "data.jsonl")
# export all of the tables
output2 = Query(lib).export_all_tables(query="Holiday", output_fp=output_data)

return 0

p= parsing_the_pdfs() This is the output when I execute the code: Update: parsing time : 0.0057866573333740234 Update: parsing output : {'docs_added': 0, 'blocks_added': 0, 'images_added': 0, 'pages_added': 0, 'tables_added': 0, 'rejected_files': []}

noman1321 commented 3 days ago

assign me this issue

noman1321 commented 3 days ago

@vijayproxima you can fix this in 3 ways 1.import tabula

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Extract tables from the PDF (all pages)

tables = tabula.read_pdf(pdf_file, pages='all', multiple_tables=True)

Check how many tables were extracted

print(f'Total tables extracted: {len(tables)}')

Export the tables to CSV files

for i, table in enumerate(tables): output_csv = f'outputtable{i}.csv' table.to_csv(output_csv, index=False) print(f'Table {i} saved to {output_csv}')

noman1321 commented 3 days ago

2. import camelot

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Extract tables from all pages of the PDF

tables = camelot.read_pdf(pdf_file, pages='all')

Check how many tables were extracted

print(f'Total tables extracted: {len(tables)}')

Export each extracted table to a separate CSV file

for i, table in enumerate(tables): output_csv = f'outputtable{i}.csv' table.to_csv(output_csv) print(f'Table {i} saved to {output_csv}')

noman1321 commented 3 days ago

3.import pdfplumber

Path to your PDF file

pdf_file = 'your_pdf_file.pdf'

Open the PDF with pdfplumber

with pdfplumber.open(pdf_file) as pdf: for page_number, page in enumerate(pdf.pages):

Extract tables from the page

    tables = page.extract_tables()

    # Check if any tables were found
    if tables:
        for i, table in enumerate(tables):
            # Save each table as a CSV
            output_csv = f'output_table_page_{page_number}_table_{i}.csv'

            # Writing table to CSV
            with open(output_csv, 'w') as f:
                for row in table:
                    f.write(','.join(str(cell) for cell in row) + '\n')

            print(f'Table {i} from page {page_number} saved to {output_csv}')
    else:
        print(f'No tables found on page {page_number}')

please let me if any of this is helpfull for your repositries