Open amitjoy opened 2 weeks ago
Hi @amitjoy, could you please share your MVE? I am not able to reproduce using Cohere (instead of OpenAI), and using the sample text referenced in the notebook from Greg Kamradt. I have python 3.10.11, but the same exact packages. Below works without any errors (percentile and 95 are the defaults so did not change them).
import os
from langchain.embeddings import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain.docstore.document import Document
from langchain_community.embeddings import CohereEmbeddings
if __name__ == '__main__':
os.environ["OPENAI_API_KEY"] = "<your_key>"
os.environ["COHERE_API_KEY"] = "<your_key>"
with open(r'./data/mit.txt') as file:
essay = file.read()
doc = Document(page_content=essay)
# embeddings = OpenAIEmbeddings()
embeddings = CohereEmbeddings(model="embed-english-light-v3.0")
chunker = SemanticChunker(embeddings)
docs = chunker.transform_documents([doc, ])
print(f"{len(docs)}")
I am currently using VertexAI Gemini to ingest data from Confluence:
self.chunker = SemanticChunker(
embeddings=vector_db.embedding, //VertexAIEmbeddings
breakpoint_threshold_type=self.settings.db.vector_db.chunking.semantic.breakpoint_threshold.type, // percentile
breakpoint_threshold_amount=self.settings.db.vector_db.chunking.semantic.breakpoint_threshold.amount) // 95.0
def ingest_data(self, spaces: List[str]):
for space in spaces:
click.echo(f"⇢ Loading data from space '{space}'")
confluence_loader = self.loader(space)
documents: List[Document] = []
if self.chunker is not None:
docs: List[Document] = confluence_loader.load()
documents.extend(self.chunker.split_documents(docs))
elif if self.splitter is not None:
documents.extend(confluence_loader.load_and_split(self.splitter))
"""adding space ID to the existing metadata"""
for doc in documents:
doc.metadata["space_key"] = space
# the following metadata is required for ragas
doc.metadata['filename'] = space
yield documents
Hi @amitjoy, this is not a MVE, e.g. the elif does not even use the chunker.
My best guess is that you are not getting back any embeddings. The stacktrace is pretty clear about this, so e.g. try to print out len(embeddings) just before the for loop.
What I would recommend is to strip down your code to one document where you call the split_documents (or transform_documents, which is just a wrapper) function. Additionally, try to get rid of the confluence_loader because that should not affect the end result.
An example for a MVE: use the code I provided (including the mentioned document) but just replace CohereEmbeddings with VertexAIEmbeddings. If it does fail, it's an issue with VertexAiEmbeddings. If it does not fail, then use one of your documents. If that does not fail, it's an issue with the confluence_loader, otherwise it's an issue with the document.
Hi @tibor-reiss, @amitjoy,
I have a similar issue. It can be reproduced with the following snippet:
import itertools
import lorem
from google.cloud import aiplatform
# from langchain.embeddings import VertexAIEmbeddings # this one works
from langchain_google_vertexai import VertexAIEmbeddings # this one fails
from langchain_experimental.text_splitter import SemanticChunker
aiplatform.init(project=PROJECT_ID, location=LOCATION)
embedding_model = VertexAIEmbeddings("text-embedding-004")
text_splitter = SemanticChunker(embedding_model)
document_chunks = text_splitter.split_text(" ".join(itertools.islice(lorem.sentence(word_range=(8, 16)), 200)))
Please note that the issue does not occur when using from langchain.embeddings import VertexAIEmbeddings
, but it triggers a deprecation warning.
The problem seems to come from the batch-size calculation, in langchain_google_vertexai/embeddings.py
, which produces arbitrarily low values for the batch size, despite the total number of texts being higher.
In text_splitter.py
, the length of embeddings
differs from that of sentences
.
embeddings = self.embeddings.embed_documents( # <<< does not return with the correct number of embeddings
[x["combined_sentence"] for x in sentences]
)
for i, sentence in enumerate(sentences):
>>>> sentence["combined_sentence_embedding"] = embeddings[i]
Hi @jsconan, thanks for checking this. As suspected, it's an issue (or let's say feature) with the new VertexAIEmbeddings, and not with the SemanticChunker. I can see in the source that there were indeed some significant changes. I would recommend changing the title of this issue, or even better, opening a new one on https://github.com/langchain-ai/langchain-google.
Thank you @tibor-reiss. I've created an issue as you suggested: https://github.com/langchain-ai/langchain-google/issues/353
@amitjoy FYI, the issue has been fixed in the unreleased version: https://github.com/langchain-ai/langchain-google/issues/353#issuecomment-2210474071
Checked other resources
Example Code
In text_splitter.py (SemanticChunker)
Error Message and Stack Trace (if applicable)
Description
System Info
langchain==0.2.5 langchain-community==0.2.5 langchain-core==0.2.9 langchain-experimental==0.0.61 langchain-google-vertexai==1.0.5 langchain-postgres==0.0.8 langchain-text-splitters==0.2.1
Mac M3 Python 3.10.14