Successful but Extremely Slow Response Time from LLM

grabani commented 7 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I am able to successfully receive a response running the following code, however, the response takes 4m 44.9s. When I use the same embeddings database but direct to openAI it is instance. Any ideas why the local LLM is taking so long!!

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import OpenAI
from IPython.display import Markdown, display
import chromadb

# define embedding function

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Load LLM

from llama_index import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    #model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=r"C:\Users\Home\.cache\lm-studio\models\TheBloke\Mistral-7B-Instruct-v0.1-GGUF\mistral-7b-instruct-v0.1.Q6_K.gguf",
    temperature=0.1,
    max_new_tokens=4048,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=8000,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    #model_kwargs={"n_gpu_layers": 1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

# load embeddings from disk

db2 = chromadb.PersistentClient(path="./data/jhoward_L4/chroma_db")

chroma_collection = db2.get_or_create_collection("DB_collection")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model,
                                               chunk_size=100,
                                               chunk_overlap=20)

index = VectorStoreIndex.from_vector_store(
    vector_store,
    service_context=service_context,
)
# Response synthesis
query_engine = index.as_query_engine()

response = query_engine.query("What is lesson about?")

display(Markdown(f"{response}"))

dosubot[bot] commented 7 months ago

🤖

Hello @grabani,

I'm Dosu, a friendly bot here to assist you with your queries, help solve bugs, and guide you towards becoming an effective contributor to LlamaIndex. While you're waiting for a human maintainer, I'm here to support you. Let's get started, shall we?

Based on the information provided, there are a few potential reasons why your local LLM is taking significantly longer to respond compared to when using the same embeddings database but directed to OpenAI.

Firstly, the LLM model you're using could be a factor. The default model (text-davinci-003) is slower than others like gpt-3.5. You might want to consider using a different LLM model to improve the query time. You can find more information about this in the solved issue: Reaction is very slow, is there any params to reduce reaction time?

Secondly, the version of LlamaIndex you're using could also be affecting the response time. Newer versions of LlamaIndex have a smaller default chunk size (1024 vs. 3900 in older versions), which makes them faster. You're currently using version 0.9.9, and there might be updates available that could improve performance. You can check the LlamaIndex repository for the latest version.

Lastly, the local LLM might be taking longer to respond due to the fallback to synchronous methods when asynchronous methods are not implemented. As seen in the _aget_query_embedding and _aget_text_embedding methods, if the aembed_query or aembed_documents methods raise a NotImplementedError, the code falls back to the synchronous _get_query_embedding or _get_text_embedding methods. This fallback can cause a significant slowdown, especially when dealing with large amounts of data, as synchronous methods do not allow for concurrent processing. This means that each query or document must be processed one at a time, rather than processing multiple queries or documents at once. You can find more information about this in the llama_index/embeddings/langchain.py file.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 7 months ago

Your LlamaCPP runs on CPU. You can also compile llama-cpp-python to run on GPU, but even then, it will be much slower than OpenAI.

If you want a faster local LLM, you need a good GPU and probably need to use an optimized service like vLLM or HuggingFace text-generation-interface

rawkintrevo commented 6 months ago

I'm hitting similar issues. Yes, CPU is much slower than local GPU which is much slower than an optimized service, but LlamaCPP on my colab will respond sufficiently in ~30s, but llama-index is takes 5-35m. Debugging isn't useful either :( I would consider this a bug.

rawkintrevo commented 6 months ago

Debugging isn't useful either

is not a useful gripe- so there is a screenshot of the debugging output- it took 25m on first run, 4m on second run.

rawkintrevo commented 6 months ago

LlamaCPP stand alone - 38s

rawkintrevo commented 6 months ago

Further @logan-markewich - the issue seems to be something happening in the background- Once it starts going, it works as expected, its just that it take 24m/3m for it to get things together and start printing output.

rawkintrevo commented 6 months ago

now this is more for the next person on the search (ie me next week), than part of a bug report. when moving from streaming to normy query, the debugging is more useful.

rawkintrevo commented 6 months ago

Is there anyway to turn up the fidelity on the logs and find out what is taking so long in _llm ?

logan-markewich commented 6 months ago

Not really tbh

In your example of using llamacpp standalone, you are only passing in a very short prompt, and not generating very many tokens.

You can check out the source code here to compare or investigate. But llamaindex isn't really doing anything special, just calling the LLM with possibly large inputs https://github.com/run-llama/llama_index/blob/e3591ba9d856168273e9f74432829ae1718f8a5b/llama_index/llms/llama_cpp.py#L96

logan-markewich commented 6 months ago

You can also enable logging to see every llm input and output https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#simple-llm-inputs-outputs

rawkintrevo commented 6 months ago

OK, I played with it some more when I got home and thanks so much for the pointers and reply @logan-markewich

I got it to a point where it's at least usable for my usecase. Some tips for anyone else who happens this way:

Make super baby chunk sizes via TextSplitter
Be prepared to tinker with the response synthesizer, at least grok the ResponseModes
Keep your top_k as low as you dare.
It will still be a dog compared to OpenAI or some fancy service with GPUs or GPUs on your own machine, but if you have some time to learn it, and not just copy paste a recipe, you can make this work in a Cloud Function, and by that I mean getting responses in < 4 min.

@logan-markewich if y'all would take a drive-by commit of docs on hypermilling eg running lamma-index effectively on CPU only I'd be happy to write something up, let me know where to drop it. If not, here is a (sloppy/hacked up) colab of my adventures, designed to help the next person.

Also, in either event, great work and thanks so much for such a great product!

https://colab.research.google.com/drive/1VXcD9YBLCap7CT23V8qF-UCm0h9wIlrB?usp=sharing

grabani commented 6 months ago

Share my test results and thoughts:

Note: 1) I used your @rawkintrevo's colab 2) Used the OpenAIEmbedding embeddings model & the mistral-7b-instruct-v0.1.Q6_K.gguf LLM model. 3) Retrieved my data (Index) from an existing ChromadB collection. Note when the index was created it included the ServiceContext parameter node_parser=sentence_node_parser 4) Added Token Counter Handling:

token_counter = TokenCountingHandler(
verbose=True,  # set to true to see usage printed to the console
)

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug,token_counter])

5) My results:

6) This is the output of the counter handling:

7) The overall time for the two LLM queries is:

8) Then as a comparison I used LM Studio. Used the same LM Model as in the colab. Set up my system prompt (see as per the details captured in point #6).

9) My results on LM Studio directly were:

I am running with windows 11 , intel i5 (12th Gen), 16GB RAM, GPU (Intel Iris(R).

CONCLUSION

The time taken to run the query via Llamaindex was NOT significantly longer than when ran using LM Studio. Seems that Llamindex itself does not cause unnecessary delay. As can be seen from the CBEeventtimes, the pre LLM times are insignificant.

run-llama / llama_index