mem0ai / mem0

The Memory layer for your AI apps
https://mem0.ai
Apache License 2.0
22.79k stars 2.09k forks source link

Bug: ValueError: Batch size 367 exceeds maximum batch size 166 #788

Closed 2PlatesDev closed 1 year ago

2PlatesDev commented 1 year ago

🐛 Describe the bug

I just installed this via pip in a PyCharm venv and it's throwing an error while using the example code given in the docs. It only seems to be an issue with larger resources as when commenting out elon_musk_bot.add("https://en.wikipedia.org/wiki/Elon_Musk"), it runs perfectly fine. Same goes for other resources I've tried to add.

Any help would be appreciated. All relevant information that I could think of is below but let me know if there is anything else I should add that could be beneficial.

import os

from embedchain import App

os.environ["OPENAI_API_KEY"] = "api_key"
elon_musk_bot = App()

# Embed Online Resources
elon_musk_bot.add("https://en.wikipedia.org/wiki/Elon_Musk")
elon_musk_bot.add("https://www.forbes.com/profile/elon-musk")

response = elon_musk_bot.query("How many companies does Elon Musk run and name those?")
print(response)
# Answer: 'Elon Musk currently runs several companies. As of my knowledge, he is the CEO and lead designer of SpaceX, the CEO and product architect of Tesla, Inc., the CEO and founder of Neuralink, and the CEO and founder of The Boring Company. However, please note that this information may change over time, so it's always good to verify the latest updates.'
../embedchain_testing/venv/bin/python ../embedchain_testing/main.py
Traceback (most recent call last):
  File "../embedchain_testing/main.py", line 9, in <module>
    elon_musk_bot.add("https://en.wikipedia.org/wiki/Elon_Musk")
  File "../embedchain_testing/venv/lib/python3.9/site-packages/embedchain/embedchain.py", line 201, in add
    documents, metadatas, _ids, new_chunks = self.load_and_embed(
  File "../embedchain_testing/venv/lib/python3.9/site-packages/embedchain/embedchain.py", line 399, in load_and_embed
    self.db.add(
  File "../embedchain_testing/venv/lib/python3.9/site-packages/embedchain/vectordb/chroma.py", line 139, in add
    self.collection.add(documents=documents, metadatas=metadatas, ids=ids)
  File "../embedchain_testing/venv/lib/python3.9/site-packages/chromadb/api/models/Collection.py", line 100, in add
    self._client._add(ids, self.id, embeddings, metadatas, documents)
  File "../embedchain_testing/venv/lib/python3.9/site-packages/chromadb/api/segment.py", line 264, in _add
    validate_batch(
  File "../embedchain_testing/venv/lib/python3.9/site-packages/chromadb/api/types.py", line 377, in validate_batch
    raise ValueError(
ValueError: Batch size 367 exceeds maximum batch size 166
embedchain_testing % pip list
Package                Version
---------------------- ---------
aiofiles               23.2.1
aiohttp                3.8.6
aiosignal              1.3.1
annotated-types        0.6.0
anyio                  3.7.1
async-timeout          4.0.3
attrs                  23.1.0
backoff                2.2.1
bcrypt                 4.0.1
beautifulsoup4         4.12.2
Brotli                 1.1.0
certifi                2023.7.22
charset-normalizer     3.3.0
chroma-hnswlib         0.7.3
chromadb               0.4.14
click                  8.1.7
coloredlogs            15.0.1
dataclasses-json       0.5.14
docx2txt               0.8
duckduckgo-search      3.9.3
embedchain             0.0.67
exceptiongroup         1.1.3
fastapi                0.103.2
filelock               3.12.4
flatbuffers            23.5.26
frozenlist             1.4.0
fsspec                 2023.9.2
grpcio                 1.59.0
h11                    0.14.0
h2                     4.1.0
hpack                  4.0.0
httpcore               0.18.0
httptools              0.6.0
httpx                  0.25.0
huggingface-hub        0.17.3
humanfriendly          10.0
hyperframe             6.0.1
idna                   3.4
importlib-resources    6.1.0
langchain              0.0.279
langsmith              0.0.43
lxml                   4.9.3
marshmallow            3.20.1
monotonic              1.6
mpmath                 1.3.0
multidict              6.0.4
mypy-extensions        1.0.0
numexpr                2.8.7
numpy                  1.26.0
onnxruntime            1.16.0
openai                 0.27.10
overrides              7.4.0
packaging              23.2
pip                    23.2.1
posthog                3.0.2
protobuf               4.24.4
pulsar-client          3.3.0
pydantic               2.4.2
pydantic_core          2.10.1
pypdf                  3.16.3
PyPika                 0.48.9
python-dateutil        2.8.2
python-dotenv          1.0.0
pytube                 15.0.0
PyYAML                 6.0.1
regex                  2023.10.3
requests               2.31.0
setuptools             68.2.2
six                    1.16.0
sniffio                1.3.0
socksio                1.0.0
soupsieve              2.5
SQLAlchemy             2.0.21
starlette              0.27.0
sympy                  1.12
tenacity               8.2.3
tiktoken               0.4.0
tokenizers             0.14.1
tqdm                   4.66.1
typer                  0.9.0
typing_extensions      4.8.0
typing-inspect         0.9.0
urllib3                2.0.6
uvicorn                0.23.2
uvloop                 0.17.0
watchfiles             0.20.0
websockets             11.0.3
wheel                  0.36.2
yarl                   1.9.2
youtube-transcript-api 0.6.1
zipp                   3.17.0
PyCharm 2023.2.2 (Professional Edition)
Build #PY-232.9921.89, built on October 1, 2023
Runtime version: 17.0.8+7-b1000.22 aarch64
VM: OpenJDK 64-Bit Server VM by JetBrains s.r.o.
macOS 14.0
GC: G1 Young Generation, G1 Old Generation
Memory: 2048M
Cores: 8
Metal Rendering is ON
MacBook Pro
13-inch, M1, 2020
Chip: Apple M1
Memory: 8gb
MacOS: 14.0 (23A344)
rupeshbansal commented 1 year ago

Hey @2PlatesDev , I would strongly recommend removing your OpenAI API Key from the description above.

Do you have any preconfigured chromadb settings which might be coming into play here? It does seem like chromadb recently introduced some limits which you might be hitting: https://discord.com/channels/1073293645303795742/1156956190350245908/1156956190350245908

Nevertheless, looking into the implementation, I do think batching should be introduced in the add/get functions to handle such cases. @deshraj