Closed suraj-gade closed 1 year ago
If you do
llm('Test string')
Does it run on CPU or GPU?
Just trying to understand if it's an issue inside LlamaIndex, or about the configuration of the HuggingFacePipeline
Hi @Disiok , Thanks for response.
Executing this llm('Test string')
also utilizes the CPU Only.
Sounds like a configuration problem with huggingface tbh. Going to close this for now. Feel free to re-open if you can confirm the model is actually running on GPU before giving it to llama-index
Hi,
I am building a chatbot using LLM like fastchat-t5-3b-v1.0 and want to reduce my inference time.
I am loading the entire model on GPU, using device_map parameter, and making use of hugging face pipeline agent for querying the LLM model. Also specifying the device=0 ( which is the 1st rank GPU) for hugging face pipeline as well. I am monitoring the GPU and CPU usage throughout the entire execution, and I can see that though my model is on GPU, at the time of querying the model, it makes use of CPU. The spike in CPU usage shows that query execution is happening on CPU.
Below is the code that I am using to do inference on Fastchat LLM.
Here the “data” folder has my full input text in pdf format, and am using the GPTVectoreStoreIndex and hugging face pipeline to build the index on that and fetch the relevant chunk to generate the prompt with context and query the FastChat model as shown in the code.
Please have a look, and let me know if this is the expected behaviour. how can I make use of GPU for query execution as well? to reduce the inference response time.