IndexError: list index out of range when using VertexAIEmbeddings with SemanticChunker

jsconan commented 2 months ago

Checked other resources

[x] I added a very descriptive title to this issue.
[x] I searched the LangChain documentation with the integrated search.
[x] I used the GitHub search to find a similar question and didn't find it.
[x] I am sure that this is a bug in LangChain rather than my code.
[x] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Description

I'm splitting documents using SemanicChunker with VertexAIEmbeddings. When the number of chunks is high enough (more than ~120), I'm getting IndexError: list index out of range .

Please note that the issue does not occur when using the previous implementation from langchain.embeddings.VertexAIEmbeddings, but it obviously triggers a deprecation warning.

The problem seems to come from the batch-size calculation, in langchain_google_vertexai/embeddings.py, which produces arbitrarily low values for the batch size, despite the total number of texts being higher. More exactly, the first batch is ok, but the second has a lower size despite the remaining number of chunks should produce more batches.

Example Code

import os
import getpass
import itertools
import lorem
from dotenv import load_dotenv
from google.cloud import aiplatform
# from langchain.embeddings import VertexAIEmbeddings  # this one works
from langchain_google_vertexai import VertexAIEmbeddings # this one fails
from langchain_experimental.text_splitter import SemanticChunker

load_dotenv()
# .env file must looks like this:
#
# GOOGLE_APPLICATION_CREDENTIALS=
# PROJECT_ID=
# LOCATION=
#

PROJECT_ID = os.environ.get("PROJECT_ID")
LOCATION = os.environ.get("LOCATION", "europe-west1")

if PROJECT_ID is None:
    PROJECT_ID = getpass.getpass("Project ID")

aiplatform.init(project=PROJECT_ID, location=LOCATION)

embedding_model = VertexAIEmbeddings("text-embedding-004")

text_splitter = SemanticChunker(embedding_model)

NB_SENTENCES = 200 # up to 120 it is ok

document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))

Error Message and Stack Trace (if applicable)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[1], line 34
     29 text_splitter = SemanticChunker(embedding_model)
     32 NB_SENTENCES = 200 # up to 120 it is ok
---> 34 document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))

File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:215, in SemanticChunker.split_text(self, text)
    213 if len(single_sentences_list) == 1:
    214     return single_sentences_list
--> 215 distances, sentences = self._calculate_sentence_distances(single_sentences_list)
    216 if self.number_of_chunks is not None:
    217     breakpoint_distance_threshold = self._threshold_from_clusters(distances)

File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:200, in SemanticChunker._calculate_sentence_distances(self, single_sentences_list)
    196 embeddings = self.embeddings.embed_documents(
    197     [x["combined_sentence"] for x in sentences]
    198 )
    199 for i, sentence in enumerate(sentences):
--> 200     sentence["combined_sentence_embedding"] = embeddings[i]
    202 return calculate_cosine_distances(sentences)

IndexError: list index out of range

System Info

langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.11 langchain-experimental==0.0.62 langchain-google-community==1.0.6 langchain-google-vertexai==1.0.6 langchain-text-splitters==0.2.1

Mac M3 Pro (macOS 14.5) Python 3.12

lkuligin commented 2 months ago

Can you please re-install langchain-google-vertexai from Github, and try again? I believe there was a bug that was fixed last week. P.S. Don't forget you need to uninstall the existing version before installing from Github.

jsconan commented 2 months ago

Thank you @lkuligin Indeed, the GitHub version does not have the issue. I am waiting now for its release.

In the meantime, here is a way to make it work, with pip:

pip uninstall langchain-google-vertexai
pip install git+ssh://git@github.com/langchain-ai/langchain-google.git#subdirectory=libs/vertexai

langchain-ai / langchain-google