langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.9k stars 15.37k forks source link

When indexing vectorstore, if no collection, create one - 404 collection not found in Qdrant when Indexing #18068

Closed seanmavley closed 4 months ago

seanmavley commented 8 months ago

Checked other resources

Example Code

import os
from dotenv import load_dotenv
from langchain.document_loaders.pdf import PyMuPDFLoader
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Qdrant
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain.indexes import SQLRecordManager, index

load_dotenv()

loaders = {
    '.pdf': PyMuPDFLoader,
    '.txt': TextLoader
}

def create_directory_loader(file_type, directory_path):
    '''Define a function to create a DirectoryLoader for a specific file type'''
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type],
        show_progress=True,
        use_multithreading=True
    )

dirpath = os.environ.get('TEMP_DOCS_DIR')

txt_loader = create_directory_loader('.txt', dirpath)

texts = txt_loader.load()

full_text = ''
for paper in texts:
    full_text = full_text + paper.page_content

full_text = " ".join(l for l in full_text.splitlines() if l)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2048,
    chunk_overlap=512
)

document_chunks = text_splitter.create_documents(
    [full_text], [{'source': 'education'}])

embeddings = GPT4AllEmbeddings()

collection_name = "testing_v1"

namespace = f"mydata/{collection_name}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)

record_manager.create_schema()

url = 'http://0.0.0.0:6333'
from qdrant_client import QdrantClient

client = QdrantClient(url)

qdrant = Qdrant(
    client=client,
    embeddings=embeddings,
    collection_name='testing_v1',
)

index_stats= index(
    document_chunks,
    record_manager,
    qdrant,
    cleanup="full",
    source_id_key="source"
)

print(index_stats)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "/home/khophi/Development/myApp/llm/api/embeddings.py", line 79, in <module>
    index_stats = index(
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/langchain/indexes/_api.py", line 326, in index
    vector_store.add_documents(docs_to_index, ids=uids)
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/langchain_core/vectorstores.py", line 119, in add_documents
    return self.add_texts(texts, metadatas, **kwargs)
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/langchain_community/vectorstores/qdrant.py", line 181, in add_texts
    self.client.upsert(
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/qdrant_client/qdrant_client.py", line 987, in upsert
    return self._client.upsert(
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/qdrant_client/qdrant_remote.py", line 1300, in upsert
    http_result = self.openapi_client.points_api.upsert_points(
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 1439, in upsert_points
    return self._build_for_upsert_points(
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 738, in _build_for_upsert_points
    return self.api_client.request(
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 74, in request
    return self.send(request, type_)
  File "/home/khophi/Development/myApp/llm/venv/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 97, in send
    raise UnexpectedResponse.for_response(response)
qdrant_client.http.exceptions.UnexpectedResponse: Unexpected Response: 404 (Not Found)
Raw response content:
b'{"status":{"error":"Not found: Collection `testing_v1` doesn\'t exist!"},"time":0.0000653}'
(venv) khophi@KhoPhi:~/Development/myApp/llm/api$ 

Description

I'm following the tutorial here trying to use Qdrant as the vectorstore

https://python.langchain.com/docs/modules/data_connection/indexing

According to the docs:

Do not use with a store that has been pre-populated with content independently of the indexing API, as the record manager will not know that records have been inserted previously.

If preferably it should work with a brand new collection from start, then the chances of such a collection not existing will be true, at which point I'd expect the index to have something similar to the force_create in the Qdrant.from_documents(...) function to create the collection if it doesn't exist, before proceeding. That way the index db and collection all have the same start.

As it stands now, there isn't a way to create an empty collection in Qdrant.

System Info

System Information

OS: Linux OS Version: #1 SMP Thu Oct 5 21:02:42 UTC 2023 Python Version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

Package Information

langchain_core: 0.1.24 langchain: 0.0.350 langchain_community: 0.0.3 langsmith: 0.1.3 langchain_cli: 0.0.19 langchain_experimental: 0.0.47 langchain_mistralai: 0.0.4 langchain_openai: 0.0.6 langchainhub: 0.1.14 langgraph: 0.0.24 langserve: 0.0.36

skozlovf commented 8 months ago

@seanmavley

As it stands now, there isn't a way to create an empty collection in Qdrant.

Try this:

qdrant = Qdrant.construct_instance(["fake"], embeddings, collection_name="testing_v1", url="http://0.0.0.0:6333")