llmware-ai / llmware

Unified framework for building enterprise RAG pipelines with small, specialized models
https://llmware-ai.github.io/llmware/
Apache License 2.0
4.22k stars 812 forks source link

Issues with Table Extraction Accuracy in Sample #162

Open li-rongzhi opened 7 months ago

li-rongzhi commented 7 months ago

Problem Description

When using the sample provided by the llmware project, I've encountered issues with the accuracy of table extractions. Specifically, not all tables are being extracted correctly. As an example given in the sample, Annual_Report_2003.pdf.

Steps to Reproduce

Run the sample extraction process as in examples. Set query parameter to be empty string. Review the output and compare it to the expected tables within the documents. Only one table, to be exact, part of the table spreading from page 44-46, got correctly extracted.

Expected Outcome

All tables within the sample documents should be identified and extracted accurately. In this file, tables that are supposed to be extracted are given below as screenshot,

Screenshot 2023-12-07 at 11 26 35 AM Screenshot 2023-12-07 at 11 26 53 AM Screenshot 2023-12-07 at 11 27 06 AM Screenshot 2023-12-07 at 11 27 20 AM Screenshot 2023-12-07 at 11 27 35 AM Screenshot 2023-12-07 at 11 27 46 AM Screenshot 2023-12-07 at 11 27 59 AM Screenshot 2023-12-07 at 11 28 13 AM

Actual Outcome

Only one csv outputs. Please refer to the outcome table_0.csv. All other table contents, though extracted in json file, is labeled as text.

Potential Impact

This issue may lead to incomplete or inaccurate data capture, which can affect the integrity of data analysis and further processing steps.

Request for Assistance

I would appreciate any guidance on how to resolve this issue or any suggested workarounds. Additionally, if there are any plans to improve the table extraction feature in the near future, information on that would also be helpful.

li-rongzhi commented 7 months ago

Also, I just dive into the repo for further investigation. It seems like you hide the implementation of method add_pdf_main_llmware. May I get more details about the layout detection model you integrated for your algorithm? Any information about the design choices, data structures, or algorithms used would be very helpful.

Screenshot 2023-12-07 at 1 12 55 PM