Embedding very sensitive to punctuation

System Info

I have been using OpenAI Embeddings specifically text-embedding-ada-002 and noticed it was very sensitive to punctuation even. I have around 1000 chunks and need to extract each time the 15 most similar chunks to my query. I have been testing my query without punctuation and when I add a dot '.' at the end of my query it changes the initial set I got from the retriever with the query without punctuation (some chunks are the same but new ones may appear or the initial order is different).

Have you noticed anything similar ?
Is it the basic behaviour of this embedding to be that sensitive to punctuation ?
Is there a way to make it more robust to minor changes in the query ?

FYI: I am using PGvector to store my chunks vectors

Who can help?

No response

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[X] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Expected behavior

🤖

Hello @yasmina-daoudi,

I'm Dosu, a bot here to assist you with your issues and inquiries about LangChain. While we're waiting for a human maintainer, feel free to ask any questions you have, or let me know if there's something specific you'd like to learn more about. I'm here to help you solve bugs, answer your questions, and even help you become a contributor. Let's get started!

Based on the information provided, it seems that the sensitivity to punctuation you're experiencing is a characteristic of the OpenAI Embedding model, specifically text-embedding-ada-002. This sensitivity can indeed affect the results of your similarity search, as the model treats punctuation as significant input.

To make the model less sensitive to punctuation changes in the query, you could preprocess your text before feeding it to the model. This preprocessing could include removing or standardizing punctuation. Here's an example of how you can do this:

import string
import re

def preprocess_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Lowercase the text
    text = text.lower()
    return text

# Use the function
preprocessed_query = preprocess_text(query)
embedding = self.embedding_function.embed_query(preprocessed_query)

This code will remove all punctuation from your text and convert it to lowercase. This way, the model will treat "Hello, world!" and "Hello world" as the same, making it less sensitive to punctuation changes.

Please note that this is a simple example and might not work perfectly for all cases. Depending on your specific needs, you might need to adjust the preprocessing steps. For example, you might want to replace punctuation with spaces, or handle apostrophes in a special way.

As for the LangChain framework, it does not explicitly handle text preprocessing before feeding it to the embedding models. The text is directly passed to the embedding model without any preprocessing steps. This is evident from the embed_documents and embed_query methods in the DeepInfraEmbeddings class. However, it's important to note that the actual preprocessing might be happening inside the embedding model itself, which is not shown in the provided context.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

langchain-ai / langchain