langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
95.52k stars 15.51k forks source link

Cannot create MistralAI embeddings from pdf or urls #27790

Open PabloVD opened 4 weeks ago

PabloVD commented 4 weeks ago

Checked other resources

Example Code

from langchain_chroma import Chroma
from langchain_mistralai import MistralAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
import requests
from pathlib import Path

# Get data from url
url = 'https://camels.readthedocs.io/_/downloads/en/latest/pdf/'
r = requests.get(url, stream=True)
document_path = Path('data.pdf')

document_path.write_bytes(r.content)
# document_path = "camels-readthedocs-io-en-latest.pdf"
loader = PyPDFLoader(document_path)
docs = loader.load()

# Split text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Create vector store
embeddings = MistralAIEmbeddings()
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)

Error Message and Stack Trace (if applicable)

An error occurred with MistralAI: 'data'
Traceback (most recent call last):
  File "/home/tda/GenAICourse/RAG/ragbot_langchain/embeddings.py", line 24, in <module>
    vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_chroma/vectorstores.py", line 1128, in from_documents
    return cls.from_texts(
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_chroma/vectorstores.py", line 1089, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_chroma/vectorstores.py", line 508, in add_texts
    embeddings = self._embedding_function.embed_documents(texts)
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_mistralai/embeddings.py", line 222, in embed_documents
    return [
  File "/home/tda/miniconda3/envs/ragbot2/lib/python3.10/site-packages/langchain_mistralai/embeddings.py", line 225, in <listcomp>
    for embedding_obj in response.json()["data"]
KeyError: 'data'

Description

Ì'm trying to create MistralAI embeddings from a pdf document but I get the mentioned error. Are there flags or parameters of RecursiveCharacterTextSplitter or MistralAIEmbeddings which could avoid such issue? Thanks in advance,

System Info

System Information

OS: Linux OS Version: #134~20.04.1-Ubuntu SMP Tue Oct 1 15:27:33 UTC 2024 Python Version: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]

Package Information

langchain_core: 0.3.14 langchain: 0.3.5 langchain_community: 0.3.3 langsmith: 0.1.138 langchain_chroma: 0.1.4 langchain_mistralai: 0.2.0 langchain_openai: 0.2.4 langchain_text_splitters: 0.3.1

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.10 async-timeout: 4.0.3 chromadb: 0.5.16 dataclasses-json: 0.6.7 fastapi: 0.112.4 httpx: 0.27.2 httpx-sse: 0.4.0 jsonpatch: 1.33 numpy: 1.26.4 openai: 1.53.0 orjson: 3.10.10 packaging: 24.1 pydantic: 2.9.2 pydantic-settings: 2.6.0 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 SQLAlchemy: 2.0.36 tenacity: 9.0.0 tiktoken: 0.8.0 tokenizers: 0.20.1 typing-extensions: 4.12.2

jamesev15 commented 4 weeks ago

The KeyError: 'data' in the issue occurs due to a 429 Too Many Requests response, which prevents some documents from being processed and results in payloads that lack the data key. To reveal the actual error and improve clarity, I propose adding a check with raise_for_status() to handle cases where the status code is not 200. This will ensure that the code only attempts to access the data key when the response is successful.

keenborder786 commented 4 weeks ago

@jamesev15 please see my PR. I have implemented a simple retrying mechanism to account for this, rather than just raising the error which works.

jamesev15 commented 4 weeks ago

Sure! I like your approach