run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.47k stars 5.21k forks source link

[Question]: Implications of making extract_table_summaries in MarkdownElementNodeParser optional #11915

Closed mlkorra closed 4 months ago

mlkorra commented 7 months ago

Question Validation

Question

Currently, I see that whenever MarkdownElementNodeParser is used, it inherently requires an LLM to extract table summaries, can we make it optional, when the requirement is only to extract the tables without any summary generation? What would be the implications of that on the performance of the Retrieval Engine?

def get_nodes_from_node(self, node: TextNode) -> List[BaseNode]:
        """Get nodes from node."""
        elements = self.extract_elements(
            node.get_content(),
            table_filters=[self.filter_table],
            node_id=node.id_,
        )
        table_elements = self.get_table_elements(elements)
        # extract summaries over table elements
        self.extract_table_summaries(table_elements)
        # convert into nodes
        # will return a list of Nodes and Index Nodes
        return self.get_nodes_from_elements(elements, node.metadata)
dosubot[bot] commented 7 months ago

Hey @mlkorra! Great to see you diving deep into the gears of LlamaIndex again. How's everything going on your end?

Based on the current implementation of the MarkdownElementNodeParser class, making the extract_table_summaries method optional could potentially lead to several performance improvements for the Retrieval Engine. These improvements could include reduced processing time, lower memory usage, simplified data structures, increased focus on the primary task of table extraction, and added flexibility for customization.

The extract_table_summaries method likely involves analyzing the content of tables to generate summaries. If this step is skipped, the Retrieval Engine can save on the computational time required for this analysis, leading to faster processing of markdown documents. Additionally, generating summaries for tables could require additional memory to hold the intermediate data structures needed for analysis and summary generation. By making this step optional and skipping it when not needed, the Retrieval Engine could potentially use less memory.

Without the need to store table summaries, the data structure used to represent extracted tables could be simplified. This could make the handling of these data structures more straightforward and potentially improve the efficiency of operations performed on them. By concentrating solely on extracting tables, the Retrieval Engine can optimize its performance for this primary task. This could lead to more efficient algorithms and methods being developed for table extraction, further improving the performance of the Retrieval Engine.

Finally, making extract_table_summaries optional adds a layer of flexibility to the Retrieval Engine. Users or downstream applications that only need table data without summaries can configure the engine to meet their specific needs, potentially leading to better overall system performance due to the reduced workload.

However, it's important to note that the exact functionality of the extract_table_summaries method and its impact on the overall performance of the Retrieval Engine would depend on its implementation details, which are not provided in the current context. Therefore, while the potential benefits outlined above are plausible, they are speculative and would need to be confirmed through performance testing and analysis.

As for the current requirements for using the MarkdownElementNodeParser class, the extraction of table summaries appears to be a part of the process when parsing markdown elements. However, it's not explicitly stated whether this step is always necessary or if it can be bypassed under certain conditions. The implementation and conditional logic within extract_table_summaries or its usage are not detailed in the provided code snippet.

For more details, you can refer to the source code of the MarkdownElementNodeParser class in the llama-index-core/llama_index/core/node_parser/relational/markdown_element.py file.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
logan-markewich commented 7 months ago

The summaries are used directly for retrieval. Without, there is no "retrieve a table by its summary and then replace with its actual table" process

mlkorra commented 7 months ago

@logan-markewich Yes, that makes sense. Meanwhile i was looking to extract the tables from the pdf on sample pdf with tables (attached image) (modifying the get_nodes_from_node function to print the table elements) , but it doesn't return any tables and only prints an empty list. What might be the issue here?

image

logan-markewich commented 7 months ago

What did you use to parse the document? I would expect something like llamaparse to work fine here

mlkorra commented 7 months ago

sorry for the late reply, I was using llamaparse for extracting the table

logan-markewich commented 7 months ago

Hmm, it looks like a scanned image, so it relies on OCR, and table extraction may fail

Can you share the actual pdf? Or is it actually just that image?

mlkorra commented 7 months ago

This is the actual pdf @logan-markewich table.pdf