Open li-rongzhi opened 7 months ago
Also, I just dive into the repo for further investigation. It seems like you hide the implementation of method add_pdf_main_llmware
. May I get more details about the layout detection model you integrated for your algorithm? Any information about the design choices, data structures, or algorithms used would be very helpful.
Problem Description
When using the sample provided by the llmware project, I've encountered issues with the accuracy of table extractions. Specifically, not all tables are being extracted correctly. As an example given in the sample, Annual_Report_2003.pdf.
Steps to Reproduce
Run the sample extraction process as in examples. Set
query
parameter to be empty string. Review the output and compare it to the expected tables within the documents. Only one table, to be exact, part of the table spreading from page 44-46, got correctly extracted.Expected Outcome
All tables within the sample documents should be identified and extracted accurately. In this file, tables that are supposed to be extracted are given below as screenshot,
Actual Outcome
Only one csv outputs. Please refer to the outcome table_0.csv. All other table contents, though extracted in json file, is labeled as text.
Potential Impact
This issue may lead to incomplete or inaccurate data capture, which can affect the integrity of data analysis and further processing steps.
Request for Assistance
I would appreciate any guidance on how to resolve this issue or any suggested workarounds. Additionally, if there are any plans to improve the table extraction feature in the near future, information on that would also be helpful.