langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
84.75k stars 13.09k forks source link

DuckDB: distance/similarity property not reported to documents returned by similarity_search #20969

Open jaceksan opened 2 weeks ago

jaceksan commented 2 weeks ago

Checked other resources

Example Code

from dotenv import load_dotenv
import duckdb
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import DuckDB
from langchain_core.documents import Document
from time import time

load_dotenv()
TABLE_NAME = "embeddings"
documents = [Document(page_content="Jacek is the best software engineer in the world", metadata={"id": "1"})]

db_conn = duckdb.connect('./test.DUCKDB')

try:
    start_exists = time()
    print("Checking table exists")
    table = db_conn.table(TABLE_NAME)
    table.show()
    vector_store = DuckDB(connection=db_conn, table_name=TABLE_NAME, embedding=OpenAIEmbeddings(), vector_key="embedding")
    print(f"Table exists check took {time() - start_exists} seconds")
except Exception as e:
    start_not_exists = time()
    print(f"Table does not exist, create it from documents")
    vector_store = DuckDB.from_documents(documents, connection=db_conn, table_name=TABLE_NAME, embedding=OpenAIEmbeddings(), vector_key="embedding")
    print(f"Table does not exist, took {time() - start_not_exists} seconds")

start_search = time()
query = "Who is the best software engineer in the world?"
docs = vector_store.similarity_search(query)
print(f"Search result: {docs}")
print(f"Search took {time() - start_search} seconds")

Error Message and Stack Trace (if applicable)

No response

Description

I use DuckDB as vector store. When I execute similarity_search, I expect distance property is returned as a (metadata) part of result documents. I discussed this issue in DuckDB community and we agreed that it is a bug and it should be returned. I am going to fix it.

System Info