run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.22k stars 5.16k forks source link

[Bug]: Using LlamaParseJsonNodeParser causes error for attached PDF document #12314

Closed wb200 closed 6 months ago

wb200 commented 6 months ago

Bug Description

Node creation using LlamaParseJsonNodeParser triggered following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[40], [line 5](vscode-notebook-cell:?execution_count=40&line=5)
      [3](vscode-notebook-cell:?execution_count=40&line=3) doc = doc_list[3]
      [4](vscode-notebook-cell:?execution_count=40&line=4) print(doc[0].metadata['source_document'])
----> [5](vscode-notebook-cell:?execution_count=40&line=5) nodes = node_parser.get_nodes_from_documents(doc, show_progress=True)
      [6](vscode-notebook-cell:?execution_count=40&line=6) base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

File [~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:76](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:76), in NodeParser.get_nodes_from_documents(self, documents, show_progress, **kwargs)
     [71](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:71) doc_id_to_document = {doc.id_: doc for doc in documents}
     [73](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:73) with self.callback_manager.event(
     [74](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:74)     CBEventType.NODE_PARSING, payload={EventPayload.DOCUMENTS: documents}
     [75](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:75) ) as event:
---> [76](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:76)     nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
     [78](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:78)     for i, node in enumerate(nodes):
     [79](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:79)         if (
     [80](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:80)             node.ref_doc_id is not None
     [81](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:81)             and node.ref_doc_id in doc_id_to_document
     [82](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:82)         ):

File [~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/base_element.py:120](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/base_element.py:120), in BaseElementNodeParser._parse_nodes(self, nodes, show_progress, **kwargs)
    [117](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/base_element.py:117) nodes_with_progress = get_tqdm_iterable(nodes, show_progress, "Parsing nodes")
    [119](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/base_element.py:119) for node in nodes_with_progress:
--> [120](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/base_element.py:120)     nodes = self.get_nodes_from_node(node)
    [121](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/base_element.py:121)     all_nodes.extend(nodes)
...
--> [263](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/llama_parse_json_element.py:263)         merged_elements[-1].element += "\n" + element.element
    [264](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/llama_parse_json_element.py:264)     else:
    [265](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/wb200/~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/llama_parse_json_element.py:265)         merged_elements.append(element)

TypeError: can only concatenate str (not "list") to str
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?afcccb15-1598-4ffd-8983-533ae373e2e2) or open in a [text editor](command:workbench.action.openLargeOutput?afcccb15-1598-4ffd-8983-533ae373e2e2). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

image

Version

llama-indexd 0.10.24 / llama-parse 0.3.9

Steps to Reproduce

The PDF can be downloaded at this link: https://www.metalsx.com.au/wp-content/uploads/2024/02/FY23-MLX-Appendix-4E-and-Annual-Financial-Report_29-February-2024.pdf

The script below works for a whole list of PDFs. But I'm only getting an error for the one linked above.

parser = LlamaParse(api_key=llama_cloud_api_key,
                    result_type="markdown",
                    num_workers=8,
                    verbose=True,
                    # parsing_instruction=parsing_instruction
                    )

json_doc_list = parser.get_json_result(pdf_files)

# Saving JSON output from LlamaParser and creating LlamaIndex Document objects
doc_list = []
for json_doc in json_doc_list:
    doc_name = json_doc['file_path']
    doc_name = doc_name.split("/")[-1]
    doc_name = doc_name.split(".")[0]
    print(f"Processing document: {doc_name} ...")
    doc = []
    for j, json_page in enumerate(json_doc['pages']):
        print(f" Processing page: {j:04d} ...")
        doc_page = Document(text=json_page.get('text'), metadata=json_page)
        doc_page.metadata['source_document'] = doc_name
        doc.append(doc_page)
        with open(f"{pdf_dir}/{doc_name}_page_{j:04d}.json", "w") as file:
            json.dump(json_page, file, indent=4)
    doc_list.append(doc)

node_parser = LlamaParseJsonNodeParser(llm=Anyscale(api_key=anyscale_api_key, model="mistralai/Mixtral-8x7B-Instruct-v0.1"), num_workers=16, include_metadata=True)

doc = doc_list[5]
print(doc[0].metadata['source_document'])

nodes = node_parser.get_nodes_from_documents(doc, show_progress=True)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

Relevant Logs/Tracbacks

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[40], line 5
      3 doc = doc_list[3]
      4 print(doc[0].metadata['source_document'])
----> 5 nodes = node_parser.get_nodes_from_documents(doc, show_progress=True)
      6 base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

File ~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py:76, in NodeParser.get_nodes_from_documents(self, documents, show_progress, **kwargs)
     71 doc_id_to_document = {doc.id_: doc for doc in documents}
     73 with self.callback_manager.event(
     74     CBEventType.NODE_PARSING, payload={EventPayload.DOCUMENTS: documents}
     75 ) as event:
---> 76     nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
     78     for i, node in enumerate(nodes):
     79         if (
     80             node.ref_doc_id is not None
     81             and node.ref_doc_id in doc_id_to_document
     82         ):

File ~/anaconda3/envs/llm/lib/python3.12/site-packages/llama_index/core/node_parser/relational/base_element.py:120, in BaseElementNodeParser._parse_nodes(self, nodes, show_progress, **kwargs)
    117 nodes_with_progress = get_tqdm_iterable(nodes, show_progress, "Parsing nodes")
    119 for node in nodes_with_progress:
--> 120     nodes = self.get_nodes_from_node(node)
    121     all_nodes.extend(nodes)
...
--> 263         merged_elements[-1].element += "\n" + element.element
    264     else:
    265         merged_elements.append(element)

TypeError: can only concatenate str (not "list") to str
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
dosubot[bot] commented 6 months ago

Hey @wb200, great to see you back! I hope you've been doing well since our last interaction. 😊

The error you're encountering, "TypeError: can only concatenate str (not "list") to str", typically occurs when you attempt to concatenate a string with a list, which is not allowed in Python. Given the context of using get_nodes_from_documents method on the node_parser object with LlamaParseJsonNodeParser in LlamaIndex v0.10.24 and LlamaParse v0.3.9, and the provided code snippets, there are no direct concatenation operations visible between a string and a list within the provided code.

However, the error might be originating from within the implementation of the get_nodes_from_documents method or any other method it calls, which is not visible in the provided context. If the method or any method it utilizes tries to concatenate the document's text (a string) with a list (possibly a parsed JSON object), this error would occur.

To resolve this issue without seeing the internal implementation of get_nodes_from_documents, ensure that any operation that combines the document text with other data converts lists (or any non-string data) to strings before concatenation. For example, if you have a list data and you want to concatenate it with a string text, you should convert data to a string first:

text = "some text"
data = ["data1", "data2"]
full_text = text + str(data)  # Converts list to string before concatenation

Or, if the list contains elements that should be concatenated as separate strings:

text = "some text"
data = ["data1", "data2"]
full_text = text + " ".join(data)  # Joins list elements into a single string before concatenation

Without access to the internal logic of get_nodes_from_documents, this is a general advice on resolving the type error by ensuring proper data type handling when concatenating strings with other data types.

Regarding the get_nodes_and_objects method in the LlamaParseJsonNodeParser class, its purpose is to separate the given list of BaseNode instances into two distinct lists: one containing BaseNode instances that are not IndexNode objects, and another containing IndexNode objects along with their associated object data. This method facilitates the handling and processing of different types of nodes within a document, distinguishing between regular nodes and index nodes that reference other objects.

The get_base_nodes_and_mappings method in the LlamaParseJsonNodeParser class of the LlamaIndex v0.10.24 is to process a list of nodes, separating them into base nodes and a mapping of index nodes to their child nodes. This method is crucial for organizing the document's structure, especially when dealing with complex documents that include both text and embedded objects like tables. It ensures that each index node is correctly associated with its child nodes, facilitating the retrieval and manipulation of document content based on its hierarchical structure.

Unfortunately, I couldn't find the exact implementation of the get_nodes_from_documents method in the LlamaParseJsonNodeParser class of the LlamaIndex v0.10.24 and the purpose and implementation of the LlamaParse class and its get_json_result method in the LlamaParse v0.3.9 within the repository. It would be helpful if you could provide more information or context about these methods.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
wb200 commented 6 months ago

Hey @dosubot, the MarkDownElementNodeParser does just fine processing the nodes from the PDF in question. So the problem must lie with the implementation of the methods get_nodes_from_documents and get_nodes_from_node in line 263 of the class JSONElementNodeParser where merged_elements[-1].element += "\n" + element.element and element.element unexpectedly turns out to be of type List and not str.

Think the code needs to be adjusted to deal with this eventuality.

dosubot[bot] commented 6 months ago

The error you're encountering with TypeError: can only concatenate str (not "list") to str in the get_nodes_from_documents method of the LlamaParseJsonNodeParser class is indeed due to element.element being unexpectedly a list instead of a string. This typically happens when the PDF document being processed contains data that is interpreted as a list during the parsing process.

To address this issue, you could modify the get_nodes_from_documents method to handle cases where element.element is a list. Here's a suggested approach:

if isinstance(element.element, list):
    merged_elements[-1].element += "\n" + ' '.join(element.element)
else:
    merged_elements[-1].element += "\n" + element.element

This code snippet checks if element.element is a list. If it is, it converts the list to a string using the join method before concatenating it with the existing string. This adjustment should prevent the TypeError from occurring.

Keep in mind, this is a workaround and might not perfectly address all cases, especially if the list contains complex data structures. For a more comprehensive solution, consider revising the parsing logic to better accommodate different types of data structures found in PDF documents.

If you encounter further issues or have additional questions, feel free to reach out.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
wb200 commented 6 months ago

The error you're encountering with TypeError: can only concatenate str (not "list") to str in the get_nodes_from_documents method of the LlamaParseJsonNodeParser class is indeed due to element.element being unexpectedly a list instead of a string. This typically happens when the PDF document being processed contains data that is interpreted as a list during the parsing process.

To address this issue, you could modify the get_nodes_from_documents method to handle cases where element.element is a list. Here's a suggested approach:

if isinstance(element.element, list):
    merged_elements[-1].element += "\n" + ' '.join(element.element)
else:
    merged_elements[-1].element += "\n" + element.element

This code snippet checks if element.element is a list. If it is, it converts the list to a string using the join method before concatenating it with the existing string. This adjustment should prevent the TypeError from occurring.

Keep in mind, this is a workaround and might not perfectly address all cases, especially if the list contains complex data structures. For a more comprehensive solution, consider revising the parsing logic to better accommodate different types of data structures found in PDF documents.

If you encounter further issues or have additional questions, feel free to reach out.

Sources

About Dosu

Hey @dosubot

This slight modification has worked to parse the problematic PDF without generating any errors:

if isinstance(element.element, list):
    merged_elements[-1].element += "\n" + ' '.join(str(e) for e in element.element)
else:
    merged_elements[-1].element += "\n" + element.element
dosubot[bot] commented 6 months ago

I'm glad to hear that the modification worked for parsing the problematic PDF without generating any errors! Your adjustment to handle cases where element.element is a list by converting each element in the list to a string before joining them is a solid approach. This ensures that even if the list contains non-string types, they are appropriately converted to strings, preventing the TypeError.

This solution not only resolves the immediate issue but also enhances the robustness of the get_nodes_from_documents method in handling various data types that might be encountered in PDF documents. It's a great example of adapting code to deal with unexpected data structures gracefully.

If you have any more questions or run into further issues, feel free to reach out. Happy coding!

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
wb200 commented 6 months ago

Hey @dosubot, any chance that the official code base could be amended to include this "fail safe"? Clearly there are PDF files out there that after parsing through LlamaParse and get_json_result method may throw up an error that gets solved by this amendment.

Please elevate to the developer in charge, if possible. Cheers!