langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.05k stars 14.65k forks source link

Why does FAISS similarity search not fetch data with respect to alphanumeric keys like ORA-14300? #13964

Closed ShahzaibAhmedKhan31 closed 5 months ago

ShahzaibAhmedKhan31 commented 9 months ago

Issue you'd like to raise.

Summary: The FAISS similarity search in LangChain is encountering difficulties when processing alpha-numeric queries that involve numeric integers. While the search performs well for queries like "What are resolutions to problems related to SAF," it exhibits suboptimal behavior when processing queries such as "Give me complete details of L2-resolution against ORA-14300," which involve alpha-numeric combinations. Note that we have already successfully embedded and indexed the above documents that includes alpha numeric key as well such as "ORA-14300".

Expected Behavior: The FAISS similarity search should accurately and effectively retrieve relevant information for alpha-numeric queries, providing precise results even when numeric integers are included in the query.

Current Behavior: The search is not functioning correctly when processing alpha-numeric queries with numeric integers. It fails to accurately identify and retrieve relevant documents, leading to a suboptimal user experience.

correct wrong

Steps to Reproduce: Index CSV data containing both text and numerical values, and subsequently execute a query that includes an alphanumeric question.

Additional Information: Environment: Langchain version (0.0.284)

Impact: This issue affects the accuracy and reliability of the FAISS similarity search, particularly when handling alpha-numeric queries that include numeric integers. Users relying on LangChain for information retrieval may experience challenges when seeking relevant documents related to such queries.

Priority: High

Are FAISS and Redis similarity searches capable of providing such high similarity search over the index? If not, please guide me on where I should turn to achieve better and more accurate results

Thank you for your attention to this matter. Feel free to request additional information if needed.

Suggestion:

No response

dosubot[bot] commented 9 months ago

🤖

Hello @ShahzaibAhmedKhan31,

Great to see you again! I hope you're doing well.

Based on the information you've provided, it seems that the issue you're experiencing with the FAISS similarity search in LangChain might be due to the way the embeddings are generated and the distance strategy used for similarity search.

In LangChain, the SemaDB class uses an embedding model to convert the input texts into vector representations (embeddings). These embeddings are then stored in the SemaDB database. When a similarity search is performed, the query is also converted into an embedding using the same model, and the most similar vectors in the database are retrieved based on the chosen distance strategy.

The issue with alpha-numeric queries with numeric integers might arise due to the following reasons:

  1. Embedding Model Limitations: The current embedding model used in LangChain for generating vector representations of the input texts is "sentence-transformers/clip-ViT-B-32". This model might not be able to accurately capture the semantic meaning of alpha-numeric strings with numeric integers. This could be due to the model not being trained on such data or not being designed to handle such inputs.

  2. Distance Strategy: The distance strategy used for similarity search might not be suitable for the type of data. For example, if the Euclidean distance is used, it might not accurately capture the similarity between high-dimensional vectors.

To improve the accuracy of the similarity search for alpha-numeric queries with numeric integers, you could consider the following approaches:

  1. Use a Different Embedding Model: If the current embedding model is not able to accurately capture the semantic meaning of alpha-numeric strings with numeric integers, you could consider using a different model that is better suited for this task. For example, a model that has been specifically trained on software logs or error messages might perform better.

  2. Use a Different Distance Strategy: If the current distance strategy is not suitable for the type of data, you could consider using a different strategy. For example, the cosine similarity might be more suitable for high-dimensional vectors.

  3. Preprocess the Data: You could consider preprocessing the data to make it more suitable for the embedding model and the distance strategy. For example, you could split the alpha-numeric strings into separate words or remove the numeric integers.

Please note that these are general suggestions and the best approach might depend on the specific characteristics of your data and the requirements of your application.

For more details, you can refer to the following sources:

I hope this helps! If you have any further questions or need more clarification, please let me know.

Sources