langchain-ai / langchain-google

MIT License
100 stars 119 forks source link

IndexError: list index out of range when using VertexAIEmbeddings with SemanticChunker #353

Closed jsconan closed 2 months ago

jsconan commented 2 months ago

Checked other resources

Description

I'm splitting documents using SemanicChunker with VertexAIEmbeddings. When the number of chunks is high enough (more than ~120), I'm getting IndexError: list index out of range .

Please note that the issue does not occur when using the previous implementation from langchain.embeddings.VertexAIEmbeddings, but it obviously triggers a deprecation warning.

The problem seems to come from the batch-size calculation, in langchain_google_vertexai/embeddings.py, which produces arbitrarily low values for the batch size, despite the total number of texts being higher. More exactly, the first batch is ok, but the second has a lower size despite the remaining number of chunks should produce more batches.

Example Code

import os
import getpass
import itertools
import lorem
from dotenv import load_dotenv
from google.cloud import aiplatform
# from langchain.embeddings import VertexAIEmbeddings  # this one works
from langchain_google_vertexai import VertexAIEmbeddings # this one fails
from langchain_experimental.text_splitter import SemanticChunker

load_dotenv()
# .env file must looks like this:
#
# GOOGLE_APPLICATION_CREDENTIALS=
# PROJECT_ID=
# LOCATION=
#

PROJECT_ID = os.environ.get("PROJECT_ID")
LOCATION = os.environ.get("LOCATION", "europe-west1")

if PROJECT_ID is None:
    PROJECT_ID = getpass.getpass("Project ID")

aiplatform.init(project=PROJECT_ID, location=LOCATION)

embedding_model = VertexAIEmbeddings("text-embedding-004")

text_splitter = SemanticChunker(embedding_model)

NB_SENTENCES = 200 # up to 120 it is ok

document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))

Error Message and Stack Trace (if applicable)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[1], line 34
     29 text_splitter = SemanticChunker(embedding_model)
     32 NB_SENTENCES = 200 # up to 120 it is ok
---> 34 document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), NB_SENTENCES)))

File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:215, in SemanticChunker.split_text(self, text)
    213 if len(single_sentences_list) == 1:
    214     return single_sentences_list
--> 215 distances, sentences = self._calculate_sentence_distances(single_sentences_list)
    216 if self.number_of_chunks is not None:
    217     breakpoint_distance_threshold = self._threshold_from_clusters(distances)

File ~/Projects/ml/ml-research/vertex-rag/.venv/lib/python3.12/site-packages/langchain_experimental/text_splitter.py:200, in SemanticChunker._calculate_sentence_distances(self, single_sentences_list)
    196 embeddings = self.embeddings.embed_documents(
    197     [x["combined_sentence"] for x in sentences]
    198 )
    199 for i, sentence in enumerate(sentences):
--> 200     sentence["combined_sentence_embedding"] = embeddings[i]
    202 return calculate_cosine_distances(sentences)

IndexError: list index out of range

System Info

langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.11 langchain-experimental==0.0.62 langchain-google-community==1.0.6 langchain-google-vertexai==1.0.6 langchain-text-splitters==0.2.1

Mac M3 Pro (macOS 14.5) Python 3.12

lkuligin commented 2 months ago

Can you please re-install langchain-google-vertexai from Github, and try again? I believe there was a bug that was fixed last week. P.S. Don't forget you need to uninstall the existing version before installing from Github.

jsconan commented 2 months ago

Thank you @lkuligin Indeed, the GitHub version does not have the issue. I am waiting now for its release.

In the meantime, here is a way to make it work, with pip:

pip uninstall langchain-google-vertexai
pip install git+ssh://git@github.com/langchain-ai/langchain-google.git#subdirectory=libs/vertexai