langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.29k stars 2.08k forks source link

Recursive Similarity Search for Unknown `k` Value #1827

Closed joaopcm closed 10 months ago

joaopcm commented 1 year ago

Issue Summary: When performing a similarity search, users often encounter the challenge of determining the appropriate k value for retrieving N similar results. However, in scenarios where the k value is unknown and users want to retrieve all possible results, the current system falls short. This issue aims to address this limitation by introducing a new feature called Recursive Similarity Search.

Issue Details: In certain situations, such as when querying the features of a product described in a comprehensive document, it becomes difficult to determine the correct k value that will retrieve all features. To overcome this challenge, Langchain has developed the Recursive Similarity Search feature. This feature allows users to perform a similarity search without relying solely on the k value. Instead, the system will return all possible results based on a user-defined minimum similarity percentage.

To implement the Recursive Similarity Search, you need to utilize a vector store as a retriever. The example provided demonstrates the usage of the feature using the Langchain library.

Example Code: The example code showcases the usage of Recursive Similarity Search using Langchain library. It demonstrates the creation of a vector store, configuration of the retrieval parameters, and executing a query to retrieve all the features of a product.

ray-so-export (1)

Expected Outcome: Upon using Recursive Similarity Search, the system is expected to return all relevant results based on the specified minimum similarity percentage. Users will receive a response that includes the text describing the features and the source documents associated with each feature. Additional Information: • The Recursive Similarity Search feature utilizes a dynamic K value, which increases progressively until no more results can be found. • The maxK value can be specified to prevent exceeding token limits. • Users should consider the chunk size and token limitations when setting the maxK value. • The Langchain library provides classes and methods for implementing Recursive Similarity Search, such as MemoryVectorStore, OpenAIEmbeddings, ChatOpenAI, ConversationalRetrievalQAChain, and BufferMemory. • Refer to the code example provided for implementing Recursive Similarity Search in your application.

joaopcm commented 1 year ago

@jacoblee93, can you take a look at this one, please? I already have an open PR for this :)

joaopcm commented 1 year ago

More context: image

chrisj74 commented 1 year ago

Hi. This looks a great feature, thanks for suggesting and putting in the effort to implement. I see the PR is closed but feature not in main. This would completely solve an issue for me. Any idea if/when it might land?

joaopcm commented 1 year ago

Hey, @chrisj74! Thank you for the interest.

I don't think the LangChain Team will approve and release this feature. As they mentioned here:

Refactored a bit, but am now rethinking whether this is a good idea - do any vector store providers document similarity score?

Am worried that it's bad to guide people to rely on this since providers could theoretically change this anytime, and if you really wanted to do this you could just set a high k and do the filtering yourself.


I'm using a custom TypeScript class I built for my app for this feature. Basically, I needed to copy the code I wrote for that PR and paste it into a brand new class inside my project to use it externally from LangChain.

I'm thinking of creating a third-party library (like a plugin) for LangChain, which would include this dynamic top_k property.

I have been using this algorithm in production, and it has significantly improved the accuracy of the AI. I would recommend trying to create your own custom class to use externally with LangChain based on the code from the PR.


By the way, do you think having a third-party library to use this algorithm alongside LangChain could be a great thing to have?

jacoblee93 commented 1 year ago

Hey folks, yeah we could merge the original PR if it would be broadly useful - my worry was that it was too specific. Will have a look today!

chrisj74 commented 1 year ago

@joaopcm thanks for the reply. I'm using this with a simple memory vector store in a chrome extension, so I assume I can just import the class either from my own code base or an external installed npm package. Given how fast things move a core feature would be preferable, then a maintained 3rd party and if no other option a local class.

In general working with a relatively diverse and large numbers of docs means a lot of prep work to do a similarity search with scores and then filter onto an unknown number of "good" results. So I was looking for a more baked in way to do this. Your PR hits the spot. @jacoblee93 hope this provides some context.

chrisj74 commented 1 year ago

Thanks both for the work on this. Have just given it a bash from 0.0.137.

Having some issues getting it to work. I am getting zero docs returned from getRelevantDocuments have tried to look at why and it seems as though the similarity score is coming back as NaN. Not sure what would cause this. Any ideas much appreciated.

joaopcm commented 1 year ago

@chrisj74

Thanks both for the work on this. Have just given it a bash from 0.0.137.

Having some issues getting it to work. I am getting zero docs returned from getRelevantDocuments have tried to look at why and it seems as though the similarity score is coming back as NaN. Not sure what would cause this. Any ideas much appreciated.

Please ensure that your vector store database returns the similarity scores for the query results. If it doesn't, it is not possible to use this algorithm. I am currently using it with Pinecone in production without any issues.

joaopcm commented 10 months ago

This feature is already merged. Closing this issue.

sattarab commented 9 months ago

@joaopcm Could you give a small example of how you are using it with Pinecone