Issue: Documents embeddings with many and similar numbers don't return good results

juan-gomez-stellion commented 9 months ago

Issue you'd like to raise.

Hi! I am trying to create a question answering chatbot with PDFs. There is something different in these documents. I have different articles with their number, example: article 1.2.2.1.3, article 1.2.3.4.5, article 2.3.4.1.3, etc.

When I ask for a specific article, it can't find the answer and return "the article x.x.x.x.x is not in the context". I have tried with some embeddings techniques and vector store, but it does not work.

Any ideas?

PD: The PDF documents are around 450 pages

Suggestion:

No response

dosubot[bot] commented 9 months ago

🤖

Hello,

Thank you for reaching out with your issue. I understand that you're having trouble retrieving specific articles by their numbers in large PDF documents using LangChain. Here are some suggestions that might help improve the accuracy of your search:

Use a different embedding model: LangChain uses DeepInfraEmbeddings for generating embeddings. You can experiment with different models available at Deep Infra's service. You can change the model_id parameter in the DeepInfraEmbeddings class to use a different model.
Increase the ef parameter: The ef parameter in the HNSWSearchParams method controls the size of the dynamic list for the nearest neighbors' search. Increasing this parameter can improve the accuracy of the search at the cost of speed.
Use a more suitable metric: The metric_type parameter in the IndexParams class is currently set to "L2". Depending on the nature of your data, you might want to experiment with other metrics like "Inner Product" or "Hamming".
Adjust the m and efConstruction parameters: These parameters are used in the HNSWParams method for creating the HNSW index. The m parameter controls the number of bi-directional links created for every new element during the index construction, and efConstruction controls the size of the priority queue. You can experiment with these parameters to see if they improve the accuracy of your search.
Use Maximal Marginal Relevance (MMR): The max_marginal_relevance_search_by_vector method can be used to reorder the search results by MMR. This can help in diversifying the search results and might improve the accuracy in some cases.
Use OpenSearch Vector Search: LangChain uses OpenSearch for vector search. You can adjust the parameters of the OpenSearch client, such as ef_search, ef_construction, and m in the _default_text_mapping method, to improve the accuracy of the search. You can also experiment with different search types like "approximate_search", "script_scoring", and "painless_scripting" in the similarity_search and similarity_search_with_score methods.
Optimize the bulk ingestion of embeddings: LangChain uses a bulk ingestion method to add embeddings to the OpenSearch index. You can adjust the bulk_size parameter in the from_texts and from_embeddings methods to optimize the ingestion process. A larger bulk_size can speed up the ingestion process but might consume more memory.

You can find the relevant code snippets in the LangChain repository. Please note that these are just suggestions and the optimal parameters can vary depending on the specific nature of your data and use case. You might need to experiment with these parameters to find the optimal configuration.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

juan-gomez-stellion commented 9 months ago

@dosu-bot I am using OpenAIEmbeddings and FAISS database, is there any thing that I can do to resolve the problem with this tools?

dosubot[bot] commented 9 months ago

🤖

Hello,

Based on your description, it seems like the issue might be related to how the documents are being processed and indexed in your FAISS database. Here are a few suggestions that might help improve the accuracy of your search:

Document Splitting: As mentioned in the context, you can use MarkdownHeaderTextSplitter and RecursiveCharacterTextSplitter classes to split your documents based on headers and chunks of text. This will allow you to retain the structure of your documents and use it for metadata filtering. However, this assumes that your PDF documents have a clear structure with headers and subheaders. If your PDFs are not structured in this way, you might need to find a different way to split your documents.
Metadata Filtering: Once your documents are split, you can use the SelfQueryRetriever class to filter based on the metadata you defined. This will allow your chatbot to return results from specific sections of your documents, improving its accuracy.
Embedding and Indexing: Make sure that you are using the correct embeddings for your use case. OpenAIEmbeddings are a good choice for general-purpose language understanding tasks, but depending on the specifics of your documents and queries, you might need to experiment with different embeddings. Also, ensure that your FAISS database is correctly indexing these embeddings.
Query Processing: Check how your queries are being processed. If your chatbot is unable to find the article, it might be because the query is not being understood correctly. You might need to preprocess your queries or use a different method to match queries to articles.

Here is a code snippet that demonstrates how you can implement these suggestions:

# Code snippet from the context

Please try these suggestions and let me know if they help. If not, could you provide more information about the structure of your PDF documents and how you are processing your queries? This will help me better understand your issue and provide more targeted advice.

Best, Dosu

Sources

docs/extras/use_cases/question_answering/how_to/document-context-aware-QA.ipynb

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

juansebm commented 9 months ago

Hey there! I'm really interested into this issue. I did get the same problem over a month ago and couldn't solve it at all. Did you find any other way to make this embedding thing better?

dosubot[bot] commented 5 months ago

Hi, @juan-gomez-stellion,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you are facing challenges with creating a question-answering chatbot that fails to find answers when asked for specific articles in PDFs with similar numbering. You've tried different embedding techniques and vector stores, but none have worked so far. Suggestions have been provided to improve the search accuracy, including using a different embedding model, adjusting parameters, and utilizing tools like OpenAIEmbeddings and FAISS database.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation.

langchain-ai / langchain