turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

Improve model load performance #171

Closed abstractdescutcheon closed 7 months ago

abstractdescutcheon commented 9 months ago

Caveat: My system is WSL2, so I don't know if this improves performance for other setups

Model load times seemed slower than they should be. Taking minutes to load even small models (~7GB). Putting the model files on solid-state did not improve times. But I could cat model.safetensors > /dev/null in about 10 seconds. So the loader must be the bottleneck. Turns out it's the safetensors library, I didn't dig into why it's so slow. But, the safetensors format is simple and it is easy to implement in python reader. After making that change model load times dropped to about 20 seconds, which is closer to the expected time.

After

$ time python test_inference.py -m models/TheBloke_Llama-2-13B-chat-GPTQ -p "Our story begins in the Scottish town of Auchtermuchty, where once" -t 10 -gs 4,12
 -- Model: models/TheBloke_Llama-2-13B-chat-GPTQ
 -- Options: ['gpu_split: 4,12', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Warmup...
 -- Generating...

Our story begins in the Scottish town of Auchtermuchty, where once a year, the villagers gather for the

 -- Response generated in 0.51 seconds, 10 tokens, 19.55 tokens/second (includes prompt eval.)

real    0m25.107s
user    0m9.687s
sys     0m5.440s

Before

$ time python test_inference.py -m models/TheBloke_Llama-2-13B-chat-GPTQ -p "Our story begins in the Scottish town of Auchtermuchty, where once" -t 10 -gs 4,12
 -- Model: models/TheBloke_Llama-2-13B-chat-GPTQ
 -- Options: ['gpu_split: 4,12', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Warmup...
 -- Generating...

Our story begins in the Scottish town of Auchtermuchty, where once a year, the villagers gather for the

 -- Response generated in 0.52 seconds, 10 tokens, 19.30 tokens/second (includes prompt eval.)

real    2m56.878s
user    0m10.909s
sys     0m28.455s

Note: TheBloke_Llama-2-13B-chat-GPTQ is 6.8GB

turboderp commented 9 months ago

This is very strange. I'd like to investigate a bit more and either break up with the safetensors library entirely or figure out why it's acting up and fix it. It's definitely not normal for it to take that long. I've only ever seen it load at roughly the raw read speed of the drive, but I'm thinking this could have to do with seeking?

Is this on a HDD by any chance?

xloem commented 9 months ago

You could try running with something like python3 -m trace --timing --trace to narrow down performance areas here.

I too have occasionally patched around huggingface's libraries to remove data copying and acquire incredible performance gains. You can share device-loaded models across processes and freaky stuff.

abstractdescutcheon commented 9 months ago

Ok, good to know that this isn't normal. The model files are on a NVMe drive, so it's not a seek. These numbers are also with a warm cache--I'd bet a lot of the file is already paged. My guess was that it's something to do with the WSL mmap on NTFS. I'll look into it more, this might just be a issue for me. If anyone else notices slow load times give a 👍

turboderp commented 7 months ago

I implemented something like this now, except a lot more needlessly complicated probably. I'm not sure if it's Windows or WSL compatible yet.