Cannot create MistralAI embeddings from pdf or urls

PabloVD commented 4 days ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_chroma import Chroma
from langchain_mistralai import MistralAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
import requests
from pathlib import Path

# Get data from url
url = 'https://camels.readthedocs.io/_/downloads/en/latest/pdf/'
r = requests.get(url, stream=True)
document_path = Path('data.pdf')

document_path.write_bytes(r.content)
# document_path = "camels-readthedocs-io-en-latest.pdf"
loader = PyPDFLoader(document_path)
docs = loader.load()

# Split text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Create vector store
embeddings = MistralAIEmbeddings()
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)

Error Message and Stack Trace (if applicable)

An error occurred with MistralAI: 'data'
Traceback (most recent call last):
  File "/home/tda/GenAICourse/RAG/ragbot_langchain/embeddings.py", line 24, in <module>
    vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_chroma/vectorstores.py", line 1128, in from_documents
    return cls.from_texts(
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_chroma/vectorstores.py", line 1089, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_chroma/vectorstores.py", line 508, in add_texts
    embeddings = self._embedding_function.embed_documents(texts)
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_mistralai/embeddings.py", line 222, in embed_documents
    return [
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_mistralai/embeddings.py", line 225, in <listcomp>
    for embedding_obj in response.json()["data"]
KeyError: 'data'

Description

Ì'm trying to create MistralAI embeddings from a pdf document but I get the mentioned error. Are there flags or parameters of RecursiveCharacterTextSplitter or MistralAIEmbeddings which could avoid such issue? Thanks in advance,

System Info

System Information

OS: Linux OS Version: #134~20.04.1-Ubuntu SMP Tue Oct 1 15:27:33 UTC 2024 Python Version: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]

Package Information

langchain_core: 0.3.14 langchain: 0.3.5 langchain_community: 0.3.3 langsmith: 0.1.138 langchain_chroma: 0.1.4 langchain_mistralai: 0.2.0 langchain_openai: 0.2.4 langchain_text_splitters: 0.3.1

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.10 async-timeout: 4.0.3 chromadb: 0.5.16 dataclasses-json: 0.6.7 fastapi: 0.112.4 httpx: 0.27.2 httpx-sse: 0.4.0 jsonpatch: 1.33 numpy: 1.26.4 openai: 1.53.0 orjson: 3.10.10 packaging: 24.1 pydantic: 2.9.2 pydantic-settings: 2.6.0 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 SQLAlchemy: 2.0.36 tenacity: 9.0.0 tiktoken: 0.8.0 tokenizers: 0.20.1 typing-extensions: 4.12.2

jamesev15 commented 4 days ago

The KeyError: 'data' in the issue occurs due to a 429 Too Many Requests response, which prevents some documents from being processed and results in payloads that lack the data key. To reveal the actual error and improve clarity, I propose adding a check with raise_for_status() to handle cases where the status code is not 200. This will ensure that the code only attempts to access the data key when the response is successful.

keenborder786 commented 4 days ago

@jamesev15 please see my PR. I have implemented a simple retrying mechanism to account for this, rather than just raising the error which works.

jamesev15 commented 3 days ago

Sure! I like your approach

langchain-ai / langchain