skuma307 commented 1 year ago

Hi, thanks for the great work in the open-source space. I am facing the below error: index = faiss.IndexFlatL2(len(embeddings[0])) IndexError: index 0 is out of bounds for axis 0 with size 0

The faiss index is empty. There are no embeddings?

Can you help me debug this? I really appreciate any help you can provide.

kamil-kaczmarek commented 1 year ago

Hi @skuma307 thanks for reaching out.

Let me ask two quick questions:

Can you point me to code in the example?
Please paste full stack trace for better context.

skuma307 commented 1 year ago

Thanks for your reply @kamil-kaczmarek ! I am using below code base: `import time

import numpy as np import ray from langchain.document_loaders import ReadTheDocsLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.vectorstores import FAISS

from embeddings import LocalHuggingFaceEmbeddings

To download the files locally for processing, here's the command line

wget -e robots=off --recursive --no-clobber --page-requisites --html-extension \

--convert-links --restrict-file-names=windows \

--domains docs.ray.io --no-parent https://docs.ray.io/en/master/

FAISS_INDEX_PATH = "faiss_index_fast" db_shards = 8

loader = ReadTheDocsLoader("docs.ray.io/en/master/")

text_splitter = RecursiveCharacterTextSplitter(

Set a really small chunk size, just to show.

chunk_size=300,
chunk_overlap=20,
length_function=len,

)

@ray.remote(num_gpus=1) def process_shard(shard): print(f"Starting process_shard of {len(shard)} chunks.") st = time.time() embeddings = LocalHuggingFaceEmbeddings("multi-qa-mpnet-base-dot-v1") result = FAISS.from_documents(shard, embeddings) et = time.time() - st print(f"Shard completed in {et} seconds.") return result

Stage one: read all the docs, split them into chunks.

st = time.time() print("Loading documents ...") docs = loader.load()

Theoretically, we could use Ray to accelerate this, but it's fast enough as is.

chunks = text_splitter.create_documents( [doc.page_content for doc in docs], metadatas=[doc.metadata for doc in docs] ) et = time.time() - st print(f"Time taken: {et} seconds. {len(chunks)} chunks generated")

Stage two: embed the docs.

print(f"Loading chunks into vector store ... using {db_shards} shards") st = time.time() shards = np.array_split(chunks, db_shards) futures = [process_shard.remote(shards[i]) for i in range(db_shards)] results = ray.get(futures) et = time.time() - st print(f"Shard processing complete. Time taken: {et} seconds.")

st = time.time() print("Merging shards ...")

Straight serial merge of others into results[0]

db = results[0] for i in range(1, db_shards): db.merge_from(results[i]) et = time.time() - st print(f"Merged in {et} seconds.")

st = time.time() print("Saving faiss index") db.save_local(FAISS_INDEX_PATH) et = time.time() - st print(f"Saved in: {et} seconds.")`

I have created a virtual env on Python 3.9 on Windows.

kamil-kaczmarek commented 1 year ago

Hi, You need to make sure that you build DB first. Have a look at this script: https://github.com/ray-project/langchain-ray/blob/main/open_source_LLM_retrieval_qa/build_vector_store.py

skuma307 commented 1 year ago

Thanks for your reply, but am I also using the same code I pasted above? Am I missing anything? I would appreciate your help. @kamil-kaczmarek

kamil-kaczmarek commented 1 year ago

@skuma307 you need to create embeddings store first. Please check these instructions for more details.

noperator commented 1 year ago

@kamil-kaczmarek , when I run python build_vector_store.py as part of the step "Building the vector store index," I get the same error described above:

Traceback (most recent call last):
  File "/root/langchain-ray/open_source_LLM_retrieval_qa/build_vector_store.py", line 64, in <module>
    results = ray.get(futures)
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2521, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::process_shard()
  File "/root/langchain-ray/open_source_LLM_retrieval_qa/build_vector_store.py", line 42, in process_shard
    result = FAISS.from_documents(shard, embeddings)
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/base.py", line 272, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/faiss.py", line 385, in from_texts
    return cls.__from(
  File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/faiss.py", line 347, in __from
    index = faiss.IndexFlatL2(len(embeddings[0]))
IndexError: index 0 is out of bounds for axis 0 with size 0

noperator commented 1 year ago

Following the guidance in https://github.com/hwchase17/chat-langchain/issues/26#issuecomment-1497117115, I fixed this error by:

changing the loader to UnstructuredURLLoader
installing the libmagic-dev package
prepending https:// to the docs URL

diff --git a/open_source_LLM_retrieval_qa/build_vector_store.py b/open_source_LLM_retrieval_qa/build_vector_store.py
index e530b54..9a519a8 100644
--- a/open_source_LLM_retrieval_qa/build_vector_store.py
+++ b/open_source_LLM_retrieval_qa/build_vector_store.py
@@ -4,7 +4,7 @@ from typing import List

 import numpy as np
 import ray
-from langchain.document_loaders import ReadTheDocsLoader
+from langchain.document_loaders import UnstructuredURLLoader
 from langchain.embeddings.base import Embeddings
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.vectorstores import FAISS
@@ -21,7 +21,7 @@ FAISS_INDEX_PATH = "faiss_index_fast"
 db_shards = 8
 ray.init()

-loader = ReadTheDocsLoader("docs.ray.io/en/master/")
+loader = UnstructuredURLLoader(urls=["https://docs.ray.io/en/master/"])

 text_splitter = RecursiveCharacterTextSplitter(
     # Set a really small chunk size, just to show.

bharaniabhishek123 commented 1 year ago

I see a lot of users following the tutorial are getting same error IndexError: index 0 is out of bounds for axis 0 with size 0 The solution does require to create embeddings store first. Please check these instructions for more details.

It would be better to have requirements.txt file inside /langchain-ray/open_source_LLM_retrieval_qa/requirements.txt move to the move one level up and add instructions in the repo rather that inside retrieval_qa ; a lot of new users will also face same issue.

Also, please include documentation links on how to spin up an ray cluster for all cloud platforms ; whether it's cluster.yaml or any other way. Writing that it's a hefty setup will not guide a user on how to do it ; This demo requires a bit of a hefty setup. It requires one machine with a 24GB GPU (eg. an AWS g5.xlarge) or a machine with 2 GPUs (minimum 16GB each) or a Ray cluster with at least 2 GPUs available.

ray-project / langchain-ray

IndexError: index 0 is out of bounds for axis 0 with size 0 #8

To download the files locally for processing, here's the command line

wget -e robots=off --recursive --no-clobber --page-requisites --html-extension \

--convert-links --restrict-file-names=windows \

--domains docs.ray.io --no-parent https://docs.ray.io/en/master/

Set a really small chunk size, just to show.

Stage one: read all the docs, split them into chunks.

Theoretically, we could use Ray to accelerate this, but it's fast enough as is.

Stage two: embed the docs.

Straight serial merge of others into results[0]