langchain-ai / langchain-google

MIT License
74 stars 78 forks source link

Corrupt/missing text embeddings since `v1.0.5` when submitting texts > 250 batch size #337

Closed kelfish closed 3 days ago

kelfish commented 3 days ago

Since upgrading from v1.0.4 to v1.0.5 and v1.0.6 I have add issues with RETRIEVAL_DOCUMENT embeddings specifically. RETRIEVAL_QUERY continues to work fine, but I suspect this is because query is a single chunk of text but my document text embeddings are a large batch.

Since this upgrade, vector embeddings created from 368 chunks of text return a 296 embeddings rather than the expected 368 without error. These embeddings appear ok and have the expected 768 dimensions, however no longer provide reliable similarity results from the vector database (MongoDB Atlas).

After spending time reviewing releases for langchain, langchain-mongodb and langchain-google-vertexai I have pinpointed the issue to a single line of code changed between v1.0.4 and v1.0.5 on this library. Removing this change (and nothing else, my embeddings behave as expected), with the line in query embeddings, no longer return the right results from my document embeddings.

PR #278 introduces the breaking change

The addition of batches = batches[1:] on line 268 of libs/vertexai/langchain_google_vertexai/embeddings.py is the culprit. I will defer to @lspataroG who likely understands this addition better than me. But I can't be the only one with this issue. My use case is nothing special. the langchain-mongodb library is simply calling the VertexAIEmbeddings.embed_documents method with a basic list of text greater than the 250 maximum.

I am happy to help with this fix as currently this is blocking me, but I don't understand this change enough and why it would cause my 368 embeddings to come out as 296 and no longer be usable.

Embedding model used: text-embedding-004

kelfish commented 3 days ago

Trying to create a small reproducible code snippet, will re-open when I have this...