Toy vectordb embedding example adopted to llama-cpp-python causes failure

System Info

I filed an issue with llama-cpp here https://github.com/ggerganov/llama.cpp/issues/3689

langchain

Version: 0.0.208
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author:
Author-email:
License: MIT
Location: Work\SHARK\shark.venv\Lib\site-packages
Requires: aiohttp, dataclasses-json, langchainplus-sdk, numexpr, numpy, openapi-schema-pydantic, pydantic, PyYAML, requests, SQLAlchemy, tenacity

llama-cpp-python

Version: 0.2.11
Summary: Python bindings for the llama.cpp library
Home-page:
Author:
Author-email: Andrei Betlen <abetlen@gmail.com>
License: MIT
Location: Work\SHARK\shark.venv\Lib\site-packages
Requires: diskcache, numpy, typing-extensions
Required-by:

Who can help?

No response

Information

[X] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[X] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

the toy code is adopted from https://learn.activeloop.ai/courses/take/langchain/multimedia/46317643-langchain-101-from-zero-to-hero It's the first toy vector db embedding example with "Napoleon"

Here is the code to reproduce the error:

import os
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

from langchain.embeddings import LlamaCppEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import LlamaCpp
from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# instantiate the LLM and embeddings models
llm = LlamaCpp(model_path="llama-2-13b-chat.Q5_K_M.gguf",
               temperature=0,
               max_tokens=1000, # this was lowered from the original value of 2000, but did not fix it
               top_p=1,
               Verbose=True)
embeddings = LlamaCppEmbeddings(model_path="llama-2-13b-chat.Q5_K_M.gguf")

# create our documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638"
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<SOME_ID>"
my_activeloop_dataset_name = "langchain_llama_00"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever())

from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType

tools = [
    Tool(
        name="Retrieval QA System",
        func=retrieval_qa.run,
        description="Useful for answering questions."
    ),
]

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

response = agent.run("When was Napoleone born?")
print(response)

in agent.run(..) line llama-cpp says it's running out of memory

ggml_allocr_alloc: not enough space in the buffer (needed 442368, largest block available 290848)
GGML_ASSERT: C:\Users\jason\AppData\Local\Temp\pip-install-4x0xr_93\llama-cpp-python_fec9a526add744f5b2436cab2e2c4c28\vendor\llama.cpp\ggml-alloc.c:173: !"not enough space in the buffer"

I don't know enough about how LlamaCppEmbeddings works to know if this is an error on my end, or a bug in llama-cpp. Any guidance is appreciated. Thank you

Expected behavior

I expect it to work like the openai example!

🤖

Hello,

Thank you for providing detailed information about your issue. The error message you're seeing from llama-cpp-python suggests that the system is running out of memory. This could be due to several reasons, and I'll try to provide some guidance on how you might be able to resolve this issue.

Firstly, the size of the model you're trying to load into memory could be too large for your system to handle. If this is the case, you might want to consider using a smaller model if possible.

Secondly, the use_mlock parameter in the LlamaCppEmbeddings class could be causing this issue. If use_mlock is set to True, it forces the system to keep the model in RAM, which could potentially lead to running out of memory if the model is large and the available RAM is not sufficient. You might want to try setting use_mlock to False to see if that resolves the issue.

Lastly, the n_batch parameter, which determines the number of tokens to process in parallel, could be a factor. If this number is too high, it could lead to high memory usage. You might want to try reducing the n_batch parameter to see if that helps.

Here's how you can adjust these parameters when initializing the LlamaCppEmbeddings:

embeddings = LlamaCppEmbeddings(model_path="llama-2-13b-chat.Q5_K_M.gguf", use_mlock=False, n_batch=100)

Please note that these are just potential solutions based on the information provided, and the actual solution might be different depending on the specific circumstances. If none of these suggestions resolve the issue, could you please provide more information about your system's specifications (e.g., RAM size) and the exact conditions under which the error occurs?

I hope this helps, and please let me know if you have any further questions.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosu-bot

I tried creating the embedding with use_mlock=False and n_batch=100, but it fails the same way. matter of fact its the exact same error (needed 442368, largest block available 290848)

🤖

Hello,

Thank you for reaching out and providing a detailed description of your issue. The error message you're encountering suggests that the system is running out of memory. This could be due to the size of the Llama model you're trying to load into memory exceeding the available RAM.

Here are a few suggestions that might help you resolve this issue:

Reduce the size of the model: If you're training your own model, consider using a smaller model architecture or reducing the size of your training data.
Increase available memory: If possible, run your application on a machine with more RAM.
Use model sharding: If your model is too large to fit into memory all at once, you can split it into smaller parts using the n_parts parameter. This will load only a part of the model into memory at a time.
Use half-precision for key/value cache: You can set the f16_kv parameter to True to use half-precision for the key/value cache, which can reduce memory usage.
Load only the vocabulary: If you only need the vocabulary and not the weights, you can set the vocab_only parameter to True to load only the vocabulary, which can significantly reduce memory usage.

Here's an example of how you can adjust these parameters:

from langchain.embeddings import LlamaCppEmbeddings
llama = LlamaCppEmbeddings(
    model_path="llama-2-13b-chat.Q5_K_M.gguf",
    n_parts=2,  # split the model into 2 parts
    f16_kv=True,  # use half-precision for key/value cache
    vocab_only=True  # only load the vocabulary, no weights
)

Remember to adjust these parameters according to your specific needs and available resources. You can find more details about these parameters in the LlamaCppEmbeddings class.

Regarding the use_mlock parameter, it is a boolean field that, when set to True, forces the system to keep the model in RAM. This can lead to faster access times as the model does not need to be loaded from disk every time it is used. However, it also means that the model will consume memory even when it is not in use, which can lead to higher overall memory usage. If your system has limited RAM or is running multiple memory-intensive tasks, setting use_mlock to True could potentially lead to out-of-memory errors.

I hope this helps! Please let me know if you have any other questions or if the issue persists.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Hi, @devzzzero,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you reported a memory error when attempting to adopt a toy vectordb embedding example from llama-cpp to llama-cpp-python. Dosu-bot provided detailed guidance on potential solutions, including adjusting parameters, reducing model size, using model sharding, and half-precision for key/value cache to address the memory issue. It seems like there was an attempt to implement the suggested changes, but the error persisted. Bet0x also reacted with a thumbs down to Dosu-bot's suggestions.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

langchain-ai / langchain