Open SmileLollipop opened 2 months ago
To resolve the pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 0
error when using the MarkdownElementNodeParser
from the llama_index
package, ensure that the input data is correctly formatted and does not contain any unclosed strings or malformed tables. The MarkdownElementNodeParser
class includes logic to handle different types of elements such as tables, text, and code blocks. It verifies that tables have the same number of columns in each row and at least two rows, which helps in avoiding malformed table errors.
Here is an example of how to handle such data:
from llama_index.legacy.llms import MockLLM
from llama_index.legacy.node_parser.relational.markdown_element import MarkdownElementNodeParser
from llama_index.legacy.schema import Document
def test_md_table_extraction() -> None:
test_data = Document(
text="""
# This is a test
| Year | Benefits |
| ---- | -------- |
| 2020 | 12,000 |
| 2021 | 10,000 |
| 2022 | 130,000 |
# This is another test
## Maybe a subheader
| Year | Benefits | age | customers |
| ---- | -------- | --- | --------- |
| 2020 | 12,000 | 12 | 100 |
| 2021 | 10,000 | 13 | 200 |
| 2022 | 130,000 | 14 | 300 |
"""
)
node_parser = MarkdownElementNodeParser(llm=MockLLM())
nodes = node_parser.get_nodes_from_documents([test_data])
print(f"Number of nodes: {len(nodes)}")
for i, node in enumerate(nodes, start=0):
print(f"Node {i}: {node}, Type: {type(node)}")
assert len(nodes) == 6
test_md_table_extraction()
Additionally, the md_to_df
function in the MarkdownElementNodeParser
class converts markdown tables to pandas DataFrames by replacing certain characters and ensuring the table structure is correct. This function helps in avoiding malformed table errors by verifying that tables have the same number of columns in each row and at least two rows. Here is the relevant part of the md_to_df
function:
def md_to_df(md_str: str) -> pd.DataFrame:
"""Convert Markdown to dataframe."""
# Replace " by "" in md_str
md_str = md_str.replace('"', '""')
# Replace markdown pipe tables with commas
md_str = md_str.replace("|", '","')
# Remove the second line (table header separator)
lines = md_str.split("\n")
md_str = "\n".join(lines[:1] + lines[2:])
# Remove the first and last second char of the line (the pipes, transformed to ",")
lines = md_str.split("\n")
md_str = "\n".join([line[2:-2] for line in lines])
# Check if the table is empty
if len(md_str) == 0:
return None
# Use pandas to read the CSV string into a DataFrame
return pd.read_csv(StringIO(md_str))
Ensure that the markdown text in the Document
object is correctly formatted and does not contain any unclosed strings or malformed tables. This should help in resolving the ParserError
[1][2][3][4].
To continue talking to Dosu, mention @dosu.
+1 @logan-markewich is there a way to ignore errors on this?
Bug Description
Error: _torch_pytree._register_pytree_node( Traceback (most recent call last): File "d:\Baoxin\Program\knowledgeManager\module\b_extract copy.py", line 174, in
raw_nodes = parser.get_nodes_from_documents(md_doc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\interface.py", line 129, in get_nodes_from_documents
nodes = self._parse_nodes(documents, show_progress=show_progress, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\base_element.py", line 120, in _parse_nodes
nodes = self.get_nodes_from_node(node)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\markdown_element.py", line 25, in get_nodes_from_node
elements = self.extract_elements(
^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\markdown_element.py", line 168, in extract_elements
should_keep = all(tf(element) for tf in table_filters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\markdown_element.py", line 168, in
should_keep = all(tf(element) for tf in table_filters)
^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\markdown_element.py", line 220, in filter_table
table_df = md_to_df(table_element.element)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\llama_index\core\node_parser\relational\utils.py", line 26, in md_to_df
return pd.read_csv(StringIO(md_str))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\readers.py", line 620, in _read
parser = TextFileReader(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\readers.py", line 1620, in init
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\readers.py", line 1898, in _make_engine
return mapping[engine](f, self.options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\software\anaconda3\envs\llm_env\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 93, in init
self._reader = parsers.TextReader(src, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.cinit
File "parsers.pyx", line 663, in pandas._libs.parsers.TextReader._get_header
File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 0
Version
0.10.18
Steps to Reproduce
from llama_index.core.node_parser import MarkdownElementNodeParser parser = MarkdownElementNodeParser( num_workers=3, include_metadata=True, )
raw_nodes = parser.get_nodes_from_documents(md_doc)
Relevant Logs/Tracbacks
No response