Since upgrading from v1.0.4 to v1.0.5 and v1.0.6 I have add issues with RETRIEVAL_DOCUMENT embeddings specifically. RETRIEVAL_QUERY continues to work fine, but I suspect this is because query is a single chunk of text but my document text embeddings are a large batch.
Since this upgrade, vector embeddings created from 368 chunks of text return a 296 embeddings rather than the expected 368 without error. These embeddings appear ok and have the expected 768 dimensions, however no longer provide reliable similarity results from the vector database (MongoDB Atlas).
After spending time reviewing releases for langchain, langchain-mongodb and langchain-google-vertexai I have pinpointed the issue to a single line of code changed between v1.0.4 and v1.0.5 on this library. Removing this change (and nothing else, my embeddings behave as expected), with the line in query embeddings, no longer return the right results from my document embeddings.
PR #278 introduces the breaking change
The addition of batches = batches[1:] on line 268 of libs/vertexai/langchain_google_vertexai/embeddings.py is the culprit. I will defer to @lspataroG who likely understands this addition better than me. But I can't be the only one with this issue. My use case is nothing special. the langchain-mongodb library is simply calling the VertexAIEmbeddings.embed_documents method with a basic list of text greater than the 250 maximum.
I am happy to help with this fix as currently this is blocking me, but I don't understand this change enough and why it would cause my 368 embeddings to come out as 296 and no longer be usable.
Since upgrading from
v1.0.4
tov1.0.5
andv1.0.6
I have add issues withRETRIEVAL_DOCUMENT
embeddings specifically.RETRIEVAL_QUERY
continues to work fine, but I suspect this is because query is a single chunk of text but my document text embeddings are a large batch.Since this upgrade, vector embeddings created from
368
chunks of text return a296
embeddings rather than the expected368
without error. These embeddings appear ok and have the expected768
dimensions, however no longer provide reliablesimilarity
results from the vector database (MongoDB Atlas
).After spending time reviewing releases for
langchain
,langchain-mongodb
andlangchain-google-vertexai
I have pinpointed the issue to a single line of code changed betweenv1.0.4
andv1.0.5
on this library. Removing this change (and nothing else, my embeddings behave as expected), with the line in query embeddings, no longer return the right results from my document embeddings.PR #278 introduces the breaking change
The addition of
batches = batches[1:]
on line268
oflibs/vertexai/langchain_google_vertexai/embeddings.py
is the culprit. I will defer to @lspataroG who likely understands this addition better than me. But I can't be the only one with this issue. My use case is nothing special. thelangchain-mongodb
library is simply calling theVertexAIEmbeddings.embed_documents
method with a basic list of text greater than the250
maximum.I am happy to help with this fix as currently this is blocking me, but I don't understand this change enough and why it would cause my
368
embeddings to come out as296
and no longer be usable.Embedding model used:
text-embedding-004