run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.78k stars 5.06k forks source link

multmodal: 3 columns passed, passed data had 5 columns #9130

Closed andysingal closed 10 months ago

andysingal commented 10 months ago

Documentation Issue Description

While working on:

import os
import pickle

if not os.path.exists("2021_nodes.pkl"):
    raw_nodes_2021 = node_parser.get_nodes_from_documents(docs_2021)
    pickle.dump(raw_nodes_2021, open("2021_nodes.pkl", "wb"))
else:
    raw_nodes_2021 = pickle.load(open("2021_nodes.pkl", "rb"))

i get the following error:

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py](https://localhost:8080/#) in _finalize_columns_and_data(content, columns, dtype)
    968     try:
--> 969         columns = _validate_or_indexify_columns(contents, columns)
    970     except AssertionError as err:

12 frames
AssertionError: 3 columns passed, passed data had 5 columns

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py](https://localhost:8080/#) in _finalize_columns_and_data(content, columns, dtype)
    970     except AssertionError as err:
    971         # GH#26429 do not raise user-facing AssertionError
--> 972         raise ValueError(err) from err
    973 
    974     if len(contents) and contents[0].dtype == np.object_:

ValueError: 3 columns passed, passed data had 5 columns

Documentation Link

https://gpt-index.readthedocs.io/en/v0.9.3/examples/multi_modal/llava_multi_modal_tesla_10q.html

dosubot[bot] commented 10 months ago

🤖

Hello @andysingal, good to see you again! I see you have another issue to discuss. Let's dive into it!

The error message you're encountering, ValueError: 3 columns passed, passed data had 5 columns, typically occurs when you're trying to create a pandas DataFrame and the number of columns you've specified doesn't match the number of columns in the data you're passing to the DataFrame constructor.

In your case, it seems like the function node_parser.get_nodes_from_documents(docs_2021) is returning a data structure with 5 columns, but somewhere in your code (or in the LlamaIndex codebase), a DataFrame is being created with only 3 columns.

Unfortunately, without more information about the node_parser.get_nodes_from_documents(docs_2021) function and how it's implemented in the LlamaIndex codebase, it's difficult to provide a more specific solution.

From the context provided, the get_nodes_from_documents function in the HierarchicalNodeParser class is used to parse a sequence of documents into a list of nodes. However, it's not clear how this data is structured and whether it's directly related to the error you're seeing.

If you could provide more details about the node_parser.get_nodes_from_documents(docs_2021) function and the data it's returning, it would be easier to diagnose the issue. Specifically, it would be helpful to know the structure of docs_2021 and the output of node_parser.get_nodes_from_documents(docs_2021).

Also, if you could point to the specific line in the LlamaIndex codebase where the DataFrame is being created, that would also be very helpful. This error is likely occurring because the DataFrame constructor is expecting data with a different structure than what's being provided.

Sources

andysingal commented 10 months ago

🤖

Hello @andysingal, good to see you again! I see you have another issue to discuss. Let's dive into it!

The error message you're encountering, ValueError: 3 columns passed, passed data had 5 columns, typically occurs when you're trying to create a pandas DataFrame and the number of columns you've specified doesn't match the number of columns in the data you're passing to the DataFrame constructor.

In your case, it seems like the function node_parser.get_nodes_from_documents(docs_2021) is returning a data structure with 5 columns, but somewhere in your code (or in the LlamaIndex codebase), a DataFrame is being created with only 3 columns.

Unfortunately, without more information about the node_parser.get_nodes_from_documents(docs_2021) function and how it's implemented in the LlamaIndex codebase, it's difficult to provide a more specific solution.

From the context provided, the get_nodes_from_documents function in the HierarchicalNodeParser class is used to parse a sequence of documents into a list of nodes. However, it's not clear how this data is structured and whether it's directly related to the error you're seeing.

If you could provide more details about the node_parser.get_nodes_from_documents(docs_2021) function and the data it's returning, it would be easier to diagnose the issue. Specifically, it would be helpful to know the structure of docs_2021 and the output of node_parser.get_nodes_from_documents(docs_2021).

Also, if you could point to the specific line in the LlamaIndex codebase where the DataFrame is being created, that would also be very helpful. This error is likely occurring because the DataFrame constructor is expecting data with a different structure than what's being provided.

Sources

Thanks for your response, the code is from your notebook: https://gpt-index.readthedocs.io/en/v0.9.3/examples/multi_modal/llava_multi_modal_tesla_10q.html

hatianzhang commented 10 months ago

@andysingal I cannot reproduce. Do you already have 2021_nodes.pkl locally or you are generating it?

andysingal commented 10 months ago

@andysingal I cannot reproduce. Do you already have 2021_nodes.pkl locally or you are generating it?

just running your colab notebook , i am generating 2021_nodes.pkl

hatianzhang commented 10 months ago

ic I try to reproduce the error by running

raw_nodes_2021 = node_parser.get_nodes_from_documents(docs_2021)
pickle.dump(raw_nodes_2021, open("2021_nodes.pkl", "wb"))

but got different issue: https://github.com/run-llama/llama_index/pull/9134 Let us see wether they are correlated. But from your error log, seems a different issue

igygi commented 10 months ago

I'm having the same issue when running this line in your demo notebook here the tutorial notebook linked in your demo video

import os
import pickle

if not os.path.exists("2021_nodes.pkl"):
    raw_nodes_2021 = node_parser.get_nodes_from_documents(docs_2021)
    pickle.dump(raw_nodes_2021, open("2021_nodes.pkl", "wb"))
else:
    raw_nodes_2021 = pickle.load(open("2021_nodes.pkl", "rb"))

Have you found a resolution for this?

hatianzhang commented 10 months ago

@igygi could you rebase to the latest main branch and try again. I think those 2 issues should have been fixed.

hexapode commented 10 months ago

I run into the same issue with the latest version of main. I implement a patch / PR that should fix it. https://github.com/run-llama/llama_index/pull/9249