langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.02k stars 14.64k forks source link

Intergation issue between langchain-pinecone and google vertex AI textembedding-gecko@003 #20118

Closed shixiao11 closed 1 month ago

shixiao11 commented 5 months ago

Checked other resources

Example Code

if name == 'main': input = 'where is my dog?'

#create embedding function by using model of 'textembedding-gecko@003'
vertexai_embedding_003 = VertexAIEmbeddings(model_name='textembedding-gecko@003')

# init a pinecone vectorstore with vertex ai embedding
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"), environment='us-central1-gcp')
vector_store = PineconeVectorStore(index_name='embedding-test', embedding=vertexai_embedding_003)

# create a test document
doc = Document(
    page_content=input,
    metadata={'category': 'pet'}
)
# save in the index
vector_store.add_documents([doc])

# similarity search from data we inserted before
print(vector_store.similarity_search_with_score(input))

Error Message and Stack Trace (if applicable)

Screenshot of different vectors by embedding the same input('where is my dog?') Embedding result when doing insertion

Screenshot 2024-04-07 at 16 51 36

Embedding result when doing query

Screenshot 2024-04-07 at 16 51 19

No response

Description

Hello Langchain team, I found the embedding issue between adding embedding in pinecone and do similarity_search_with_score from pinecone by using the model of 'textembedding-gecko@003' of google vertex ai. It only happen on 'textembedding-gecko@003', for 'textembedding-gecko@001' works fine How to reproduce 1, adding input string by using vector_store.add_documents([doc]), before it does insertion, the code will calculate the vectors by 'textembedding-gecko@003'. And then it will store the vectors and metadata into vectorstore. 2, And if we search the exactly same string by using function of 'similarity_search_with_score', our expectation score should be 1, because the input query is the same. But actually, it return '0.79' due to the wrong embedding result

After I debug the code and I found there is issue of embedding ways between stage of adding document and stage of searching document. here is the sreenshot issue We can see adding documents and query documents passed the different 'embedding_task_type' which is the reason of giving the different embedding result by passing the same input

And meanwhile parameter of 'embedding_task_type' is hardcode for these to functions, user is not able to customized it.

Here is the doc of explanation of google https://cloud.google.com/python/docs/reference/aiplatform/latest/vertexai.language_models.TextEmbeddingInput.

Conclusion, if devs follow the documents of langchian to inert and query by using 'textembedding-gecko@003', it is very easy to meet the this issue

System Info

langchain==0.1.14 langchain_google_vertexai==0.1.2 langchain-pinecone==0.0.3

CrazyWr commented 4 months ago

https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#api_changes_to_models_released_on_or_after_august_2023

The TaskType is a new param, but I don't think this is a good implement