turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

extremely high CPU usage #508

Closed sfttfs closed 1 week ago

sfttfs commented 1 week ago

Hello,

I was playing around with the dynamic generator, and found that during inferencing, all the cpu cores are 100% used. I had to limit cpu usage with torch.set_num_threads. Is this a normal behavior? I thought the inference happens on gpu, and the cpu usage wouldn't be so high.

What I tried is to simply pass to the generate() function a list of 100 prompts. I have pytorch 2.3.1, python 3.11 and cuda 11.8. Hardwares are RTX 8000 48gb and a 28 core cpu.

Thanks for the amazing work!

turboderp commented 1 week ago

By default it uses a pool of 16 threads for sampling. You can set a different number with max_sampling_threads when creating the generator.

There's something else going on if torch.set_num_threads makes a difference, though. It might be Torch getting a little excited sometimes. I've seen this behavior before, and I think it might have a very low threshold for when to use multiple threads for simple CPU tensor operations. Like it'll launch 16 threads to copy a 16-element tensor, that sort of thing. I'll look into it.

mushinbush commented 1 week ago

Hello! I have encountered a similar issue. I am using the latest version of tabbyAPI (exllamav2 0.1.5). My CPU is a 12900K, with 8P+8E, for a total of 24 threads, with 2x3090 GPU. cache_mode: FP16

I found that when I load Qwen2-72B-Instruct-3.5bpw and Cat-Llama-3-70B-instruct-exl2_3.5bpw, generating causes the CPU to reach 100%. image However, when I load Midnight-Miqu-70B-v1.5_exl2_3.5bpw, the CPU does not experience as much load. image The generation speed doesn't differ much; both are around 13-15 T/s with ~13K cached tokens.

sfttfs commented 1 week ago

By default it uses a pool of 16 threads for sampling. You can set a different number with max_sampling_threads when creating the generator.

There's something else going on if torch.set_num_threads makes a difference, though. It might be Torch getting a little excited sometimes. I've seen this behavior before, and I think it might have a very low threshold for when to use multiple threads for simple CPU tensor operations. Like it'll launch 16 threads to copy a 16-element tensor, that sort of thing. I'll look into it.

I tried some other code where exllama3 is not involved, and got the same high CPU usage. Guess it really has something to do with pytorch. This only appears after I messed up my conda env and reinstalled everything (with same pytorch version, however). Really weird.

mushinbush commented 1 week ago

I did some testing, and it appears to be an issue with torch 2.3.1. I downgraded torch to 2.2.0 in tabbyAPI's environment and performed Qwen2-72B-Instruct-3.5bpw's inference again. The CPU load issue did not occur. Now it was similar to when inferring Midnight-Miqu-70B-v1.5_exl2_3.5bpw.

sfttfs commented 1 week ago

downgrading pytorch didn't work for me. Pytorch still uses all CPU cores 100% unless torch.set_num_threads() is used...

mushinbush commented 1 week ago

downgrading pytorch didn't work for me. Pytorch still uses all CPU cores 100% unless torch.set_num_threads() is used...

Would you mind try the command pip show exllamav2 and post the output?

sfttfs commented 1 week ago

downgrading pytorch didn't work for me. Pytorch still uses all CPU cores 100% unless torch.set_num_threads() is used...

Would you mind try the command pip show exllamav2 and post the output?

here it is:

Name: exllamav2
Version: 0.1.5
Summary: 
Home-page: https://github.com/turboderp/exllamav2
Author: turboderp
Author-email: 
License: MIT
Location: /home/qyuan/miniconda3/envs/env0/lib/python3.11/site-packages
Requires: fastparquet, ninja, numpy, pandas, pygments, regex, safetensors, sentencepiece, torch, websockets
Required-by: 

btw I'm experiencing this during inference. not a huge problem though...

mushinbush commented 1 week ago

downgrading pytorch didn't work for me. Pytorch still uses all CPU cores 100% unless torch.set_num_threads() is used...

Would you mind try the command pip show exllamav2 and post the output?

here it is:

Name: exllamav2
Version: 0.1.5
Summary: 
Home-page: https://github.com/turboderp/exllamav2
Author: turboderp
Author-email: 
License: MIT
Location: /home/qyuan/miniconda3/envs/env0/lib/python3.11/site-packages
Requires: fastparquet, ninja, numpy, pandas, pygments, regex, safetensors, sentencepiece, torch, websockets
Required-by: 

btw I'm experiencing this during inference. not a huge problem though...

Have you uninstalled exllamav2 and reinstalled it after downgrading PyTorch?

pip uninstall exllamav2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install .
sfttfs commented 1 week ago

downgrading pytorch didn't work for me. Pytorch still uses all CPU cores 100% unless torch.set_num_threads() is used...

Would you mind try the command pip show exllamav2 and post the output?

here it is:

Name: exllamav2
Version: 0.1.5
Summary: 
Home-page: https://github.com/turboderp/exllamav2
Author: turboderp
Author-email: 
License: MIT
Location: /home/qyuan/miniconda3/envs/env0/lib/python3.11/site-packages
Requires: fastparquet, ninja, numpy, pandas, pygments, regex, safetensors, sentencepiece, torch, websockets
Required-by: 

btw I'm experiencing this during inference. not a huge problem though...

Have you uninstalled exllamav2 and reinstalled it after downgrading PyTorch?

pip uninstall exllamav2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install .

Yeah I tried to install pytorch 2.2.0 in fresh conda env and build exllamav2 from source. It's really a pytorch problem. Even a loop with two 10x10 matrices multiplying each other will result in 100% usage on all cpus...

turboderp commented 1 week ago

So I've been working on this for a bit. It does seem like PyTorch has become more aggressively multi-threaded in recent versions, which no doubt helps its performance on large compute workloads, but the tradeoff is that it will sometimes use way more threads than it needs to for small tasks.

The thresholds are a little unpredictable, but likely the reason why Qwen2 (with its larger vocabulary) uses all CPU cores while Miqu doesn't. It also looks like once it "spins up" its multi-threaded CPU BLAS engine, it seems to want to keep going for a bit before it can wind down? Not sure if there's some unintended interaction with the faster interpreter in Python 3.12, too.

The recent commits to the dev branch have a bunch of CPU optimizations, and I've decided on globally setting the number of threads to one for the time being since it improves performance significantly. The commits from cf864726c4fa4a58339da6d78e37a8d8bca5d32e overall improve tokens/second by 33% for CPU bound (i.e. very small) models, and set_num_threads(1) alone accounts for about a fifth of that.