SemanticChunker: list index out of range

amitjoy commented 2 weeks ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

In text_splitter.py (SemanticChunker)

    def _calculate_sentence_distances(
        self, single_sentences_list: List[str]
    ) -> Tuple[List[float], List[dict]]:
        """Split text into multiple components."""

        _sentences = [
            {"sentence": x, "index": i} for i, x in enumerate(single_sentences_list)
        ]
        sentences = combine_sentences(_sentences, self.buffer_size)
        embeddings = self.embeddings.embed_documents(
            [x["combined_sentence"] for x in sentences]
        )
        for i, sentence in enumerate(sentences):
            sentence["combined_sentence_embedding"] = embeddings[i] << Failed here since embeddings size is less than i at a later point

        return calculate_cosine_distances(sentences)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/A72281951/telly/telly-backend/ingestion/main.py", line 132, in start
    store.load_data_to_db(configured_spaces)
  File "/Users/A72281951/telly/telly-backend/ingestion/common/utils.py", line 70, in wrapper
    value = func(*args, **kwargs)
  File "/Users/A72281951/telly/telly-backend/ingestion/agent/store/db.py", line 86, in load_data_to_db
    for docs in self.ingest_data(spaces):
  File "/Users/A72281951/telly/telly-backend/ingestion/agent/store/db.py", line 77, in ingest_data
    documents.extend(self.chunker.split_documents(docs))
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 258, in split_documents
    return self.create_documents(texts, metadatas=metadatas)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 243, in create_documents
    for chunk in self.split_text(text):
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 201, in split_text
    distances, sentences = self._calculate_sentence_distances(single_sentences_list)
  File "/Users/A72281951/telly/venv/ingestion/lib/python3.10/site-packages/langchain_experimental/text_splitter.py", line 186, in _calculate_sentence_distances
    sentence["combined_sentence_embedding"] = embeddings[i]
IndexError: list index out of range

Description

I am trying to chunk a list of documents and it fails with this
I am using SemanticChunker from langchain-experimental~=0.0.61
breakpoint_threshold = percentile and breakpoint_threshold amount = 95.0

System Info

langchain==0.2.5 langchain-community==0.2.5 langchain-core==0.2.9 langchain-experimental==0.0.61 langchain-google-vertexai==1.0.5 langchain-postgres==0.0.8 langchain-text-splitters==0.2.1

Mac M3 Python 3.10.14

tibor-reiss commented 1 week ago

Hi @amitjoy, could you please share your MVE? I am not able to reproduce using Cohere (instead of OpenAI), and using the sample text referenced in the notebook from Greg Kamradt. I have python 3.10.11, but the same exact packages. Below works without any errors (percentile and 95 are the defaults so did not change them).

import os
from langchain.embeddings import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain.docstore.document import Document
from langchain_community.embeddings import CohereEmbeddings

if __name__ == '__main__':
    os.environ["OPENAI_API_KEY"] = "<your_key>"
    os.environ["COHERE_API_KEY"] = "<your_key>"

    with open(r'./data/mit.txt') as file:
        essay = file.read()
        doc = Document(page_content=essay)

    # embeddings = OpenAIEmbeddings()
    embeddings = CohereEmbeddings(model="embed-english-light-v3.0")
    chunker = SemanticChunker(embeddings)
    docs = chunker.transform_documents([doc, ])
    print(f"{len(docs)}")

amitjoy commented 1 week ago

I am currently using VertexAI Gemini to ingest data from Confluence:

self.chunker = SemanticChunker(
                    embeddings=vector_db.embedding, //VertexAIEmbeddings
                    breakpoint_threshold_type=self.settings.db.vector_db.chunking.semantic.breakpoint_threshold.type, // percentile
                    breakpoint_threshold_amount=self.settings.db.vector_db.chunking.semantic.breakpoint_threshold.amount) // 95.0

    def ingest_data(self, spaces: List[str]):
        for space in spaces:
            click.echo(f"⇢ Loading data from space '{space}'")
            confluence_loader = self.loader(space)

            documents: List[Document] = []
            if self.chunker is not None:
                docs: List[Document] = confluence_loader.load()
                documents.extend(self.chunker.split_documents(docs))
            elif if self.splitter is not None:
                documents.extend(confluence_loader.load_and_split(self.splitter))

            """adding space ID to the existing metadata"""
            for doc in documents:
                doc.metadata["space_key"] = space
                # the following metadata is required for ragas
                doc.metadata['filename'] = space
            yield documents

tibor-reiss commented 1 week ago

Hi @amitjoy, this is not a MVE, e.g. the elif does not even use the chunker.

My best guess is that you are not getting back any embeddings. The stacktrace is pretty clear about this, so e.g. try to print out len(embeddings) just before the for loop.

What I would recommend is to strip down your code to one document where you call the split_documents (or transform_documents, which is just a wrapper) function. Additionally, try to get rid of the confluence_loader because that should not affect the end result.

An example for a MVE: use the code I provided (including the mentioned document) but just replace CohereEmbeddings with VertexAIEmbeddings. If it does fail, it's an issue with VertexAiEmbeddings. If it does not fail, then use one of your documents. If that does not fail, it's an issue with the confluence_loader, otherwise it's an issue with the document.

jsconan commented 1 week ago

Hi @tibor-reiss, @amitjoy,

I have a similar issue. It can be reproduced with the following snippet:

import itertools
import lorem
from google.cloud import aiplatform
# from langchain.embeddings import VertexAIEmbeddings  # this one works
from langchain_google_vertexai import VertexAIEmbeddings # this one fails
from langchain_experimental.text_splitter import SemanticChunker

aiplatform.init(project=PROJECT_ID, location=LOCATION)

embedding_model = VertexAIEmbeddings("text-embedding-004")

text_splitter = SemanticChunker(embedding_model)

document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), 200)))

Please note that the issue does not occur when using from langchain.embeddings import VertexAIEmbeddings, but it triggers a deprecation warning.

The problem seems to come from the batch-size calculation, in langchain_google_vertexai/embeddings.py, which produces arbitrarily low values for the batch size, despite the total number of texts being higher.

In text_splitter.py, the length of embeddings differs from that of sentences.

        embeddings = self.embeddings.embed_documents( # <<< does not return with the correct number of embeddings
            [x["combined_sentence"] for x in sentences]
        )
        for i, sentence in enumerate(sentences):
>>>>        sentence["combined_sentence_embedding"] = embeddings[i]

tibor-reiss commented 6 days ago

Hi @jsconan, thanks for checking this. As suspected, it's an issue (or let's say feature) with the new VertexAIEmbeddings, and not with the SemanticChunker. I can see in the source that there were indeed some significant changes. I would recommend changing the title of this issue, or even better, opening a new one on https://github.com/langchain-ai/langchain-google.

jsconan commented 6 days ago

Thank you @tibor-reiss. I've created an issue as you suggested: https://github.com/langchain-ai/langchain-google/issues/353

jsconan commented 5 days ago

@amitjoy FYI, the issue has been fixed in the unreleased version: https://github.com/langchain-ai/langchain-google/issues/353#issuecomment-2210474071

langchain-ai / langchain