Closed grabani closed 7 months ago
🤖
Hello @grabani,
I'm Dosu, a friendly bot here to assist you with your queries, help solve bugs, and guide you towards becoming an effective contributor to LlamaIndex. While you're waiting for a human maintainer, I'm here to support you. Let's get started, shall we?
Based on the information provided, there are a few potential reasons why your local LLM is taking significantly longer to respond compared to when using the same embeddings database but directed to OpenAI.
Firstly, the LLM model you're using could be a factor. The default model (text-davinci-003) is slower than others like gpt-3.5. You might want to consider using a different LLM model to improve the query time. You can find more information about this in the solved issue: Reaction is very slow, is there any params to reduce reaction time?
Secondly, the version of LlamaIndex you're using could also be affecting the response time. Newer versions of LlamaIndex have a smaller default chunk size (1024 vs. 3900 in older versions), which makes them faster. You're currently using version 0.9.9, and there might be updates available that could improve performance. You can check the LlamaIndex repository for the latest version.
Lastly, the local LLM might be taking longer to respond due to the fallback to synchronous methods when asynchronous methods are not implemented. As seen in the _aget_query_embedding
and _aget_text_embedding
methods, if the aembed_query
or aembed_documents
methods raise a NotImplementedError
, the code falls back to the synchronous _get_query_embedding
or _get_text_embedding
methods. This fallback can cause a significant slowdown, especially when dealing with large amounts of data, as synchronous methods do not allow for concurrent processing. This means that each query or document must be processed one at a time, rather than processing multiple queries or documents at once. You can find more information about this in the llama_index/embeddings/langchain.py file.
I hope this helps! If you have any other questions or need further clarification, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Your LlamaCPP runs on CPU. You can also compile llama-cpp-python to run on GPU, but even then, it will be much slower than OpenAI.
If you want a faster local LLM, you need a good GPU and probably need to use an optimized service like vLLM or HuggingFace text-generation-interface
I'm hitting similar issues. Yes, CPU is much slower than local GPU which is much slower than an optimized service, but LlamaCPP on my colab will respond sufficiently in ~30s, but llama-index is takes 5-35m. Debugging isn't useful either :( I would consider this a bug.
Debugging isn't useful either
is not a useful gripe- so there is a screenshot of the debugging output- it took 25m on first run, 4m on second run.
LlamaCPP stand alone - 38s
Further @logan-markewich - the issue seems to be something happening in the background- Once it starts going, it works as expected, its just that it take 24m/3m for it to get things together and start printing output.
now this is more for the next person on the search (ie me next week), than part of a bug report. when moving from streaming to normy query, the debugging is more useful.
Is there anyway to turn up the fidelity on the logs and find out what is taking so long in _llm
?
Not really tbh
In your example of using llamacpp standalone, you are only passing in a very short prompt, and not generating very many tokens.
You can check out the source code here to compare or investigate. But llamaindex isn't really doing anything special, just calling the LLM with possibly large inputs https://github.com/run-llama/llama_index/blob/e3591ba9d856168273e9f74432829ae1718f8a5b/llama_index/llms/llama_cpp.py#L96
You can also enable logging to see every llm input and output https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#simple-llm-inputs-outputs
OK, I played with it some more when I got home and thanks so much for the pointers and reply @logan-markewich
I got it to a point where it's at least usable for my usecase. Some tips for anyone else who happens this way:
TextSplitter
@logan-markewich if y'all would take a drive-by commit of docs on hypermilling eg running lamma-index effectively on CPU only I'd be happy to write something up, let me know where to drop it. If not, here is a (sloppy/hacked up) colab of my adventures, designed to help the next person.
Also, in either event, great work and thanks so much for such a great product!
https://colab.research.google.com/drive/1VXcD9YBLCap7CT23V8qF-UCm0h9wIlrB?usp=sharing
Share my test results and thoughts:
Note:
1) I used your @rawkintrevo's colab
2) Used the OpenAIEmbedding
embeddings model & the mistral-7b-instruct-v0.1.Q6_K.gguf
LLM model.
3) Retrieved my data (Index) from an existing ChromadB
collection. Note when the index was created it included the ServiceContext parameter node_parser=sentence_node_parser
4) Added Token Counter Handling:
token_counter = TokenCountingHandler(
verbose=True, # set to true to see usage printed to the console
)
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug,token_counter])
5) My results:
6) This is the output of the counter handling:
7) The overall time for the two LLM queries is:
8) Then as a comparison I used LM Studio
. Used the same LM Model as in the colab. Set up my system prompt
(see as per the details captured in point #6
).
9) My results on LM Studio
directly were:
I am running with windows 11 , intel i5 (12th Gen), 16GB RAM, GPU (Intel Iris(R).
CONCLUSION
The time taken to run the query via Llamaindex was NOT significantly longer than when ran using LM Studio. Seems that Llamindex itself does not cause unnecessary delay. As can be seen from the CBEevent
times, the pre LLM times are insignificant.
Question Validation
Question
I am able to successfully receive a response running the following code, however, the response takes 4m 44.9s. When I use the same embeddings database but direct to openAI it is instance. Any ideas why the local LLM is taking so long!!