Open skuma307 opened 1 year ago
Hi @skuma307 thanks for reaching out.
Let me ask two quick questions:
Thanks for your reply @kamil-kaczmarek ! I am using below code base: `import time
import numpy as np import ray from langchain.document_loaders import ReadTheDocsLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.vectorstores import FAISS
from embeddings import LocalHuggingFaceEmbeddings
FAISS_INDEX_PATH = "faiss_index_fast" db_shards = 8
loader = ReadTheDocsLoader("docs.ray.io/en/master/")
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=20,
length_function=len,
)
@ray.remote(num_gpus=1) def process_shard(shard): print(f"Starting process_shard of {len(shard)} chunks.") st = time.time() embeddings = LocalHuggingFaceEmbeddings("multi-qa-mpnet-base-dot-v1") result = FAISS.from_documents(shard, embeddings) et = time.time() - st print(f"Shard completed in {et} seconds.") return result
st = time.time() print("Loading documents ...") docs = loader.load()
chunks = text_splitter.create_documents( [doc.page_content for doc in docs], metadatas=[doc.metadata for doc in docs] ) et = time.time() - st print(f"Time taken: {et} seconds. {len(chunks)} chunks generated")
print(f"Loading chunks into vector store ... using {db_shards} shards") st = time.time() shards = np.array_split(chunks, db_shards) futures = [process_shard.remote(shards[i]) for i in range(db_shards)] results = ray.get(futures) et = time.time() - st print(f"Shard processing complete. Time taken: {et} seconds.")
st = time.time() print("Merging shards ...")
db = results[0] for i in range(1, db_shards): db.merge_from(results[i]) et = time.time() - st print(f"Merged in {et} seconds.")
st = time.time() print("Saving faiss index") db.save_local(FAISS_INDEX_PATH) et = time.time() - st print(f"Saved in: {et} seconds.")`
I have created a virtual env on Python 3.9 on Windows.
Hi, You need to make sure that you build DB first. Have a look at this script: https://github.com/ray-project/langchain-ray/blob/main/open_source_LLM_retrieval_qa/build_vector_store.py
Thanks for your reply, but am I also using the same code I pasted above? Am I missing anything? I would appreciate your help. @kamil-kaczmarek
@skuma307 you need to create embeddings store first. Please check these instructions for more details.
@kamil-kaczmarek , when I run python build_vector_store.py
as part of the step "Building the vector store index," I get the same error described above:
Traceback (most recent call last):
File "/root/langchain-ray/open_source_LLM_retrieval_qa/build_vector_store.py", line 64, in <module>
results = ray.get(futures)
File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2521, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::process_shard()
File "/root/langchain-ray/open_source_LLM_retrieval_qa/build_vector_store.py", line 42, in process_shard
result = FAISS.from_documents(shard, embeddings)
File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/base.py", line 272, in from_documents
return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/faiss.py", line 385, in from_texts
return cls.__from(
File "/usr/local/lib/python3.9/dist-packages/langchain/vectorstores/faiss.py", line 347, in __from
index = faiss.IndexFlatL2(len(embeddings[0]))
IndexError: index 0 is out of bounds for axis 0 with size 0
Following the guidance in https://github.com/hwchase17/chat-langchain/issues/26#issuecomment-1497117115, I fixed this error by:
UnstructuredURLLoader
libmagic-dev
packagehttps://
to the docs URLdiff --git a/open_source_LLM_retrieval_qa/build_vector_store.py b/open_source_LLM_retrieval_qa/build_vector_store.py
index e530b54..9a519a8 100644
--- a/open_source_LLM_retrieval_qa/build_vector_store.py
+++ b/open_source_LLM_retrieval_qa/build_vector_store.py
@@ -4,7 +4,7 @@ from typing import List
import numpy as np
import ray
-from langchain.document_loaders import ReadTheDocsLoader
+from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings.base import Embeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
@@ -21,7 +21,7 @@ FAISS_INDEX_PATH = "faiss_index_fast"
db_shards = 8
ray.init()
-loader = ReadTheDocsLoader("docs.ray.io/en/master/")
+loader = UnstructuredURLLoader(urls=["https://docs.ray.io/en/master/"])
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
I see a lot of users following the tutorial are getting same error
IndexError: index 0 is out of bounds for axis 0 with size 0
The solution does require to create embeddings store first. Please check these instructions for more details.
It would be better to have requirements.txt file inside /langchain-ray/open_source_LLM_retrieval_qa/requirements.txt move to the move one level up and add instructions in the repo rather that inside retrieval_qa ; a lot of new users will also face same issue.
Also, please include documentation links on how to spin up an ray cluster for all cloud platforms ; whether it's cluster.yaml or any other way. Writing that it's a hefty setup will not guide a user on how to do it ;
This demo requires a bit of a hefty setup. It requires one machine with a 24GB GPU (eg. an AWS g5.xlarge) or a machine with 2 GPUs (minimum 16GB each) or a Ray cluster with at least 2 GPUs available.
Hi, thanks for the great work in the open-source space. I am facing the below error:
index = faiss.IndexFlatL2(len(embeddings[0])) IndexError: index 0 is out of bounds for axis 0 with size 0
The faiss index is empty. There are no embeddings?
Can you help me debug this? I really appreciate any help you can provide.