Closed mlkorra closed 4 months ago
Hey @mlkorra! Great to see you diving deep into the gears of LlamaIndex again. How's everything going on your end?
Based on the current implementation of the MarkdownElementNodeParser
class, making the extract_table_summaries
method optional could potentially lead to several performance improvements for the Retrieval Engine. These improvements could include reduced processing time, lower memory usage, simplified data structures, increased focus on the primary task of table extraction, and added flexibility for customization.
The extract_table_summaries
method likely involves analyzing the content of tables to generate summaries. If this step is skipped, the Retrieval Engine can save on the computational time required for this analysis, leading to faster processing of markdown documents. Additionally, generating summaries for tables could require additional memory to hold the intermediate data structures needed for analysis and summary generation. By making this step optional and skipping it when not needed, the Retrieval Engine could potentially use less memory.
Without the need to store table summaries, the data structure used to represent extracted tables could be simplified. This could make the handling of these data structures more straightforward and potentially improve the efficiency of operations performed on them. By concentrating solely on extracting tables, the Retrieval Engine can optimize its performance for this primary task. This could lead to more efficient algorithms and methods being developed for table extraction, further improving the performance of the Retrieval Engine.
Finally, making extract_table_summaries
optional adds a layer of flexibility to the Retrieval Engine. Users or downstream applications that only need table data without summaries can configure the engine to meet their specific needs, potentially leading to better overall system performance due to the reduced workload.
However, it's important to note that the exact functionality of the extract_table_summaries
method and its impact on the overall performance of the Retrieval Engine would depend on its implementation details, which are not provided in the current context. Therefore, while the potential benefits outlined above are plausible, they are speculative and would need to be confirmed through performance testing and analysis.
As for the current requirements for using the MarkdownElementNodeParser
class, the extraction of table summaries appears to be a part of the process when parsing markdown elements. However, it's not explicitly stated whether this step is always necessary or if it can be bypassed under certain conditions. The implementation and conditional logic within extract_table_summaries
or its usage are not detailed in the provided code snippet.
For more details, you can refer to the source code of the MarkdownElementNodeParser
class in the llama-index-core/llama_index/core/node_parser/relational/markdown_element.py file.
The summaries are used directly for retrieval. Without, there is no "retrieve a table by its summary and then replace with its actual table" process
@logan-markewich Yes, that makes sense. Meanwhile i was looking to extract the tables from the pdf on sample pdf with tables (attached image) (modifying the get_nodes_from_node
function to print the table elements) , but it doesn't return any tables and only prints an empty list. What might be the issue here?
What did you use to parse the document? I would expect something like llamaparse to work fine here
sorry for the late reply, I was using llamaparse for extracting the table
Hmm, it looks like a scanned image, so it relies on OCR, and table extraction may fail
Can you share the actual pdf? Or is it actually just that image?
Question Validation
Question
Currently, I see that whenever
MarkdownElementNodeParser
is used, it inherently requires an LLM to extract table summaries, can we make it optional, when the requirement is only to extract the tables without any summary generation? What would be the implications of that on the performance of the Retrieval Engine?