turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

GPU Usage Keeps High Even Without Inference Load #253

Open leonxia1018 opened 1 year ago

leonxia1018 commented 1 year ago

Configuration: AMD W7900 + Rocm5.6 image

Running the model on oobabooga/text-generation-webui, GPU Usage keeps even unload the model. Model: TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True

Running meta-llama/Llama-2-7b-chat-hf without quantiziation would not have this issue.

Is it an expected behavior?

turboderp commented 1 year ago

It's not expected, no. I have no explanation for it. Are you able to generate anything while the GPU is in this state?

leonxia1018 commented 1 year ago

It's not expected, no. I have no explanation for it. Are you able to generate anything while the GPU is in this state?

image After I unload the model, there is still 1% in the VRAM, in this state I am able to load another model and do inference, but the GPU usage will always be high. Only kill the WebUI process fix the issue.

turboderp commented 1 year ago

It must be a ROCm issue of some sort, because there's nothing running in the background, no threads or anything. There's the asynchronous device queue, but the host code synchronizes at multiple points and shouldn't be able to run at all while there are kernels still running.

Does that management interface provide any sort of additional insight into what might be running on the GPU? I.e. is it a rogue kernel, the runtime stuck in a loop trying to clean up corrupted memory, or... idk. I would strongly suspect it's ROCm specific.

leonxia1018 commented 1 year ago

It must be a ROCm issue of some sort, because there's nothing running in the background, no threads or anything. There's the asynchronous device queue, but the host code synchronizes at multiple points and shouldn't be able to run at all while there are kernels still running.

Does that management interface provide any sort of additional insight into what might be running on the GPU? I.e. is it a rogue kernel, the runtime stuck in a loop trying to clean up corrupted memory, or... idk. I would strongly suspect it's ROCm specific.

image

It seems like someone keeps laucning kerenls in the loading process.

turboderp commented 1 year ago

Well, it's a Torch kernel (elementwise_kernel) which unfortunately is called all the time for any sort of element-wise operation, so it's anyone's guess what it's doing.

But it's definitely a Torch operation that keeps firing in the background. And since ExLlama is single-threaded I can't imagine a way it could keep launching kernels like this unless it was stuck in a loop. But if it were stuck in a loop you wouldn't be able to load another model and use it.

So it sounds like this is an issue with TGW. Maybe @oobabooga has some idea what might be going on? Is ExLlama launched in a separate process or a separate thread?

oobabooga commented 1 year ago

In text-generation-webui the generation runs on a separate thread, yes. It is done using the Iteratorize class here: https://github.com/oobabooga/text-generation-webui/blob/main/modules/callbacks.py#L30

But I have never experienced any idle GPU usage.

leonxia1018 commented 1 year ago

Well, it's a Torch kernel (elementwise_kernel) which unfortunately is called all the time for any sort of element-wise operation, so it's anyone's guess what it's doing.

But it's definitely a Torch operation that keeps firing in the background. And since ExLlama is single-threaded I can't imagine a way it could keep launching kernels like this unless it was stuck in a loop. But if it were stuck in a loop you wouldn't be able to load another model and use it.

So it sounds like this is an issue with TGW. Maybe @oobabooga has some idea what might be going on? Is ExLlama launched in a separate process or a separate thread?

By adding log in model.py, I am able to narrow down the issue to "cuda_ext.exllama_ext.prepare_buffers" function. image