run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.49k stars 4.69k forks source link

[Bug]: UnstructuredElementNodeParser can't parse Table Element in html #14301

Closed hrhrng closed 2 days ago

hrhrng commented 2 weeks ago

Bug Description

I’m trying to run docs/docs/examples/query_engine/sec_tables/tesla_10q_table.ipynb, but no tables are being parsed. After debugging the source code, I found that the condition

if "unstructured.documents.html.HTMLTable" in str(type(element)):

in llama_index/core/node_parser/relational/unstructured_element.py:110 is not being satisfied. It appears that the unstructured library returns a unstructured.elements.Table object instead of unstructured.documents.html.HTMLTable when it recognize a table element in html file. I suspect this discrepancy is causing the tables to not be recognized and parsed correctly. Could this be due to the version of my unstructured library, or other reasons?

Version

0.10.47

Steps to Reproduce

pip install unstructured and run docs/docs/examples/query_engine/sec_tables/tesla_10q_table.ipynb

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 2 weeks ago

Hey there, @hrhrng! I'm a bot here to help you with bugs, questions, and contributions. How can I assist you today?

The discrepancy between unstructured.documents.html.HTMLTable and unstructured.elements.Table could indeed be due to the version of your unstructured library. The notebook docs/docs/examples/query_engine/sec_tables/tesla_10q_table.ipynb does not specify the exact versions of the unstructured library being used.

To ensure compatibility and correct parsing of tables, verify that you are using compatible versions of the unstructured library. You can check the version of the unstructured library you are using with the following command:

import unstructured
print(unstructured.__version__)

If there is a version mismatch, you may need to update or downgrade your unstructured library to match the version used in the notebook or the version that supports both unstructured.documents.html.HTMLTable and unstructured.elements.Table [1].

Additionally, the LlamaIndex documentation suggests an alternative method for parsing tables if the unstructured library fails to recognize them correctly. This involves using the html_to_df function to convert HTML table elements into DataFrames and then processing them accordingly [2][3].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 weeks ago

Probably due to unstructured changing how their types work over time. Would need a PR to fix

tibor-reiss commented 5 days ago

@hrhrng Could you please check if the linked PR fixes this?