run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.19k stars 4.82k forks source link

[Bug]: MarkdownElementNodeParser:pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 0 #13942

Open SmileLollipop opened 2 months ago

SmileLollipop commented 2 months ago

Bug Description

Error: _torch_pytree._register_pytree_node( Traceback (most recent call last): File "d:\Baoxin\Program\knowledgeManager\module\b_extract copy.py", line 174, in raw_nodes = parser.get_nodes_from_documents(md_doc) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\interface.py", line 129, in get_nodes_from_documents nodes = self._parse_nodes(documents, show_progress=show_progress, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\base_element.py", line 120, in _parse_nodes nodes = self.get_nodes_from_node(node) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\markdown_element.py", line 25, in get_nodes_from_node elements = self.extract_elements( ^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\markdown_element.py", line 168, in extract_elements should_keep = all(tf(element) for tf in table_filters) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\markdown_element.py", line 168, in should_keep = all(tf(element) for tf in table_filters) ^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\markdown_element.py", line 220, in filter_table table_df = md_to_df(table_element.element) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\utils.py", line 26, in md_to_df return pd.read_csv(StringIO(md_str)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\readers.py", line 1026, in read_csv return _read(filepath_or_buffer, kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\readers.py", line 620, in _read parser = TextFileReader(filepath_or_buffer, kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\readers.py", line 1620, in init self._engine = self._make_engine(f, self.engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\readers.py", line 1898, in _make_engine return mapping[engine](f, self.options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 93, in init self._reader = parsers.TextReader(src, kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.cinit File "parsers.pyx", line 663, in pandas._libs.parsers.TextReader._get_header File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 0

Version

0.10.18

Steps to Reproduce

from llama_index.core.node_parser import MarkdownElementNodeParser parser = MarkdownElementNodeParser( num_workers=3, include_metadata=True, )

raw_nodes = parser.get_nodes_from_documents(md_doc)

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 2 months ago

To resolve the pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 0 error when using the MarkdownElementNodeParser from the llama_index package, ensure that the input data is correctly formatted and does not contain any unclosed strings or malformed tables. The MarkdownElementNodeParser class includes logic to handle different types of elements such as tables, text, and code blocks. It verifies that tables have the same number of columns in each row and at least two rows, which helps in avoiding malformed table errors.

Here is an example of how to handle such data:

from llama_index.legacy.llms import MockLLM
from llama_index.legacy.node_parser.relational.markdown_element import MarkdownElementNodeParser
from llama_index.legacy.schema import Document

def test_md_table_extraction() -> None:
    test_data = Document(
        text="""
# This is a test

| Year | Benefits |
| ---- | -------- |
| 2020 | 12,000   |
| 2021 | 10,000   |
| 2022 | 130,000  |

# This is another test

## Maybe a subheader

| Year | Benefits | age | customers |
| ---- | -------- | --- | --------- |
| 2020 | 12,000   | 12  | 100       |
| 2021 | 10,000   | 13  | 200       |
| 2022 | 130,000  | 14  | 300       |
        """
    )

    node_parser = MarkdownElementNodeParser(llm=MockLLM())

    nodes = node_parser.get_nodes_from_documents([test_data])
    print(f"Number of nodes: {len(nodes)}")
    for i, node in enumerate(nodes, start=0):
        print(f"Node {i}: {node}, Type: {type(node)}")
    assert len(nodes) == 6

test_md_table_extraction()

Additionally, the md_to_df function in the MarkdownElementNodeParser class converts markdown tables to pandas DataFrames by replacing certain characters and ensuring the table structure is correct. This function helps in avoiding malformed table errors by verifying that tables have the same number of columns in each row and at least two rows. Here is the relevant part of the md_to_df function:

def md_to_df(md_str: str) -> pd.DataFrame:
    """Convert Markdown to dataframe."""
    # Replace " by "" in md_str
    md_str = md_str.replace('"', '""')

    # Replace markdown pipe tables with commas
    md_str = md_str.replace("|", '","')

    # Remove the second line (table header separator)
    lines = md_str.split("\n")
    md_str = "\n".join(lines[:1] + lines[2:])

    # Remove the first and last second char of the line (the pipes, transformed to ",")
    lines = md_str.split("\n")
    md_str = "\n".join([line[2:-2] for line in lines])

    # Check if the table is empty
    if len(md_str) == 0:
        return None

    # Use pandas to read the CSV string into a DataFrame
    return pd.read_csv(StringIO(md_str))

Ensure that the markdown text in the Document object is correctly formatted and does not contain any unclosed strings or malformed tables. This should help in resolving the ParserError [1][2][3][4].

To continue talking to Dosu, mention @dosu.

sheresaidon commented 1 month ago

+1 @logan-markewich is there a way to ignore errors on this?