[Bug]: Problem to run index = VectorStoreIndex.from_documents(documents)

zhouhao27 commented 3 months ago

Bug Description

Got TypeError: 'NoneType' object is not iterable when I run index = VectorStoreIndex.from_documents(documents)

Version

Latest version

Steps to Reproduce

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from dotenv import load_dotenv, find_dotenv
import os

_ = load_dotenv(find_dotenv())

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

The documents have some contents when I print it out.

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "/Users/zhouhao/Projects/AI/AI-full-stack/Lecture-Notes/07-llamaindex/run.py", line 11, in <module>
    index = VectorStoreIndex.from_documents(documents)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 145, in from_documents
    return cls(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/indices/vector_store/base.py", line 75, in __init__
    super().__init__(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/indices/base.py", line 94, in __init__
    index_struct = self.build_index_from_nodes(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/indices/vector_store/base.py", line 308, in build_index_from_nodes
    return self._build_index_from_nodes(nodes, **insert_kwargs)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/indices/vector_store/base.py", line 280, in _build_index_from_nodes
    self._add_nodes_to_index(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/indices/vector_store/base.py", line 233, in _add_nodes_to_index
    nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/indices/vector_store/base.py", line 141, in _get_node_with_embedding
    id_to_embed_map = embed_nodes(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/indices/utils.py", line 138, in embed_nodes
    new_embeddings = embed_model.get_text_embedding_batch(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py", line 230, in wrapper
    result = func(*args, **kwargs)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/core/base/embeddings/base.py", line 332, in get_text_embedding_batch
    embeddings = self._get_text_embeddings(cur_batch)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/embeddings/openai/base.py", line 429, in _get_text_embeddings
    return get_embeddings(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/llama_index/embeddings/openai/base.py", line 180, in get_embeddings
    data = client.embeddings.create(input=list_of_text, model=engine, **kwargs).data
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/openai/resources/embeddings.py", line 114, in create
    return self._post(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/openai/_base_client.py", line 1250, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/openai/_base_client.py", line 931, in request
    return self._request(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/openai/_base_client.py", line 1032, in _request
    return self._process_response(
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/openai/_base_client.py", line 1126, in _process_response
    return api_response.parse()
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/openai/_response.py", line 313, in parse
    parsed = self._options.post_parser(parsed)
  File "/opt/miniconda3/envs/llamaindex/lib/python3.10/site-packages/openai/resources/embeddings.py", line 102, in parser
    for embedding in obj.data:
TypeError: 'NoneType' object is not iterable

dosubot[bot] commented 3 months ago

Hey there, @zhouhao27! I'm here to help you out with any bugs or questions you have. I can also guide you through the process of becoming a contributor. Let's tackle this issue together!

To resolve the TypeError: 'NoneType' object is not iterable when creating an index using VectorStoreIndex.from_documents(documents), ensure that the documents variable is not None and is an iterable (like a list) of document objects. Here is an example of how to properly load documents and create the index:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Ensure documents is not None and is a list
if documents is None:
    raise ValueError("No documents were loaded. Please check the data path and loader configuration.")
if not isinstance(documents, list):
    raise TypeError("Documents should be a list of document objects.")

# Create the index
index = VectorStoreIndex.from_documents(documents)

Make sure that documents is correctly loaded and is a list of document objects before passing it to VectorStoreIndex.from_documents() ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

SID262000 commented 3 months ago

@zhouhao27 Can you please share your "./data" content, this will help us debug faster by getting into the root cause. Thanks!

zhouhao27 commented 3 months ago

Hey there, @zhouhao27! I'm here to help you out with any bugs or questions you have. I can also guide you through the process of becoming a contributor. Let's tackle this issue together!

To resolve the TypeError: 'NoneType' object is not iterable when creating an index using VectorStoreIndex.from_documents(documents), ensure that the documents variable is not None and is an iterable (like a list) of document objects. Here is an example of how to properly load documents and create the index:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Ensure documents is not None and is a list
if documents is None:
    raise ValueError("No documents were loaded. Please check the data path and loader configuration.")
if not isinstance(documents, list):
    raise TypeError("Documents should be a list of document objects.")

# Create the index
index = VectorStoreIndex.from_documents(documents)
Make sure that documents is correctly loaded and is a list of document objects before passing it to VectorStoreIndex.from_documents() [1][2][3].

To continue talking to Dosu, mention @dosu.

The first thing I checked is the documents. I print it out and I also print out the type of it. It has some contents and it's a list.

After I added your code, there is no error you added generated.

Inside my ./data folder, there is a .md file and a .pdf file.

dosubot[bot] commented 3 months ago

If the documents variable is confirmed to be a list and contains content, the issue might be related to how the VectorStoreIndex handles different document formats like .md and .pdf.

There are known issues with the VectorStoreIndex class when handling different types of document formats. Specifically, there is a bug in version 0.6 where if documents are added to the vector using the insert method, no document IDs get written to the JSON file when the storage context is persisted. However, using the VectorStoreIndex.from_documents method works fine to save the index ^[1].

Additionally, ensure that the embed_nodes function is not returning None due to the embedding model failing to generate embeddings for the provided texts. This could happen if the embedding model encounters issues with the content of the documents ^[2].

To further debug, you can try the following steps:

Check Document Content: Ensure that the content of the .md and .pdf files is properly read and processed.
Validate Embeddings: Verify that the embedding model is correctly generating embeddings for the document content.

Here is an updated example to include additional checks:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Ensure documents is not None and is a list
if documents is None:
    raise ValueError("No documents were loaded. Please check the data path and loader configuration.")
if not isinstance(documents, list):
    raise TypeError("Documents should be a list of document objects.")

# Print document contents for debugging
for doc in documents:
    print(f"Document ID: {doc.id}, Content: {doc.get_content()}")

# Create the index
index = VectorStoreIndex.from_documents(documents)

If the issue persists, consider isolating the problematic document by loading and indexing each document individually to identify if a specific file format or content is causing the error.

To continue talking to Dosu, mention @dosu.

zhouhao27 commented 3 months ago

for doc in documents:
    print(f"Document ID: {doc.id}, Content: {doc.get_content()}")

I found I don't have a id for doc. Instead I have a doc_id. get_content() returns a lot of texts. Looks correct. Is doc_id the cause of the issue?

Also has a field id_ which is the same as doc_id.

logan-markewich commented 3 months ago

This is happening inside the openai client, I don't think it's really related to llama-index. Did you set an api key? Did you change the base url or something?

zhouhao27 commented 3 months ago

This is happening inside the openai client, I don't think it's really related to llama-index. Did you set an api key? Did you change the base url or something?

I don't think so. If it's api key issue, I will get different error. I'm able to access openai with API call without any issue.

run-llama / llama_index