huggingface / cookbook

Open-source AI cookbook
https://huggingface.co/learn/cookbook
Apache License 2.0
1.62k stars 223 forks source link

"RAG with unstructured data", uses `documents` instead of `docs` / unused `docs` variable? #183

Open xarical opened 1 month ago

xarical commented 1 month ago

https://github.com/huggingface/cookbook/blob/main/notebooks/en/rag_with_unstructured_data.ipynb

Quote" Setting up the retriever This example uses ChromaDB as a vector store and BAAI/bge-base-en-v1.5 embeddings model, feel free to use any other vector store.

from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

from langchain.vectorstores import utils as chromautils

# ChromaDB doesn't support complex metadata, e.g. lists, so we drop it here.
# If you're using a different vector store, you may not need to do this
docs = chromautils.filter_complex_metadata(documents)

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

"

Should the documents parameter be replaced with docs on the second to last line, i.e. vectorstore = Chroma.from_documents(docs, embeddings)? Or is this intentional? I'm not familiar with Chroma (hence why I was using this tutorial), but I did wonder what the docs variable was for when going through the tutorial as it didn't seem to have been used anywhere. It seems like docs is a filtered version of documents, in which case it would be passed to from_documents (please correct me if that is not the case). If the docs variable is actually used somehow (in which case my bad), or if documents is meant to be replaced with docs as I think it might, it wasn't clear.