mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P inference
https://localai.io
MIT License
26.38k stars 1.98k forks source link

Memory Pool exceed #346

Open obito opened 1 year ago

obito commented 1 year ago

LocalAI version: Latest

Environment, CPU architecture, OS, and Version: Darwin macbook-2.local 22.4.0 Darwin Kernel Version 22.4.0: Mon Mar 6 21:00:41 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T8103 arm64 (M1 Macbook Air)

Describe the bug Trying to use Llama_Index to construct a vector index of a document (pdf in my case), to use it as a query engine. I'm getting the error from LocalAI: ggml_bert_new_tensor_impl: not enough space in the context's memory pool (needed 271388624, available 260703040)

To Reproduce

Expected behavior

Logs Too long to be in the issue, here is a link to the txt: https://hastebin.com/share/ukuvaranaw.rust

Additional context I have 16gb of RAM. Is there a way to not get this error, while not having to wait 30min to finish the embeddings ?

This is my Python code:

import os
from pathlib import Path

from langchain.chat_models import ChatOpenAI
from llama_index import (Document, GPTVectorStoreIndex, LLMPredictor,
                         PromptHelper, ServiceContext, SimpleDirectoryReader,
                         StorageContext, download_loader,
                         load_index_from_storage)
from llama_index.langchain_helpers.agents import (IndexToolConfig,
                                                  LlamaIndexTool)

os.environ['OPENAI_API_KEY'] = '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'

# Downloading the model
PDFReader = download_loader("PDFReader")

# Configure prompt parameters and initialise helper
max_input_size = 400
num_output = 400
max_chunk_overlap = 30

prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", openai_api_base="http://localhost:8080/v1"))

loader = PDFReader()

# Loading from a directory
documents = loader.load_data(file=Path('./data/convention.pdf'))

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)

# Construct a simple vector index
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

# rebuild storage context
#storage_context = StorageContext.from_defaults(persist_dir='./storage')
#index = load_index_from_storage(storage_context,    service_context=service_context)

# Save the index to disk
index.storage_context.persist()

query_engine = index.as_query_engine()

response = query_engine.query('Est-ce que je suis imposable sur les revenus de ma société qui réside à Dubai en étant résident fiscal français et ayant que des clients français?')
timo-klarshift commented 1 year ago

@obito Did you just post your openai key here?

obito commented 1 year ago

@obito Did you just post your openai key here?

It has been deleted for a very long time.

localai-bot commented 1 year ago

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

The error you are getting is due to a lack of memory in the context. This can happen if the model is too big or if there is not enough memory available on your machine. To fix this, you can try reducing the batch size or increasing the amount of memory available on your machine. To reduce the batch size, you can modify the max_input_size parameter in the prompt helper to a smaller value. For example, if you set it to max_input_size=200, it will only process 200 tokens at a time instead of the default 400. To increase the amount of memory available on your machine, you can try increasing the value of CONTEXT_SIZE in the Llama configuration file or running with more threads (THREADS). However, this may also require upgrading your hardware to handle the increased memory usage.

Sources: