Closed lambda7xx closed 4 months ago
Hey @lambda7xx! Great to see you here again. 😊 I'm diving into your issue and will get back to you with a thorough answer shortly! 🚀
and the from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
does not work according to the doc https://docs.llamaindex.ai/en/stable/examples/callbacks/LlamaDebugHandler/
The async method might not be faster than the sync method due to the nature of the tasks being performed. Async methods are beneficial when there are IO-bound tasks, such as network requests or file operations, where the program has to wait for the operation to complete. However, if the tasks are CPU-bound, like in most data processing tasks, async methods might not provide a speedup and can sometimes even be slower due to the overhead of context switching. In your case, it seems like the tasks are CPU-bound, which might be why you're not seeing a speedup with the async method.
Regarding the warnings related to do_sample
, temperature
, and top_p
settings, these are due to the way the HuggingFaceLLM
class is being instantiated. To resolve these warnings, you should set do_sample=True
in the generate_kwargs
dictionary when instantiating the HuggingFaceLLM
class.
For the import issue with CallbackManager
and LlamaDebugHandler
, it might be due to a version mismatch. The classes might not be available in the version of the LlamaIndex library you're using. Check the version of the library and update it if necessary. If the issue persists, it might be a bug in the library, and reporting it to the maintainers would be the next step.
I use nsys to profile the llama-index, it seems the retrieve is one the cpu side and the llm call is gpu side. is there other things that are cpu side.
@lambda7xx when running a model locally like you are, there is no advantage to async, since it is all compute bound. Async only makes sense for
a) Running LLMs over an API, so that responses can be properly awaited b) running multiple requests at once (async on its own does not speed things up, but allows you to run things concurrently)
@lambda7xx when running a model locally like you are, there is no advantage to async, since it is all compute bound. Async only makes sense for
a) Running LLMs over an API, so that responses can be properly awaited b) running multiple requests at once (async on its own does not speed things up, but allows you to run things concurrently)
@logan-markewich hi, if I want to run multiple requests at once to test its throughput, how to modify my code? i want to deploy the llamaindex as a webserver(https://github.com/run-llama/llama_index/issues/12396) and run multiple requests at once
Hi @lambda7xx
The similar timings between async and sync methods are expected in CPU-bound tasks like yours, where async doesn't inherently speed up the process but allows for concurrent IO operations. Your profiling is correct: retrieval is CPU-bound and LLM calls are GPU-bound, which explains the observed performance. For CPU-bound tasks, consider parallel processing to enhance efficiency.
I would also consider deploying your models as dedicated servers (TEI, TGI. vLLM, etc)
Question Validation
Question
according to the async doc, async can get 2x speed up. my code is below
my log is below
the time taken of async and sync are same.