turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

Feature request: Multi-gpu conversion #349

Closed richardburleigh closed 6 months ago

richardburleigh commented 7 months ago

Firstly, thank you for all your amazing work!

For those of us with multiple smaller VRAM cards (eg. 2x 3080s), conversion often gets OOM.

Multi-gpu support for conversion would be amazing in solving this.

turboderp commented 7 months ago

It's tough since some of the operations really do just require a large amount of VRAM. I'm not sure where anything could be subdivided any further, and the script already swaps data to system RAM whenever it can.

bani6809 commented 7 months ago

could the conversion be parallelized into a layer per gpu?

i have five 3090s, epyc 7402p, and 256gb of cpu ram.

turboderp commented 7 months ago

Sadly no, the layers have to be processed in order, and a lot of the time is spent just swapping data to system RAM since the calibration state is so large. I've done some experiments trying to split up the process across multiple GPUs but there's only so much splitting you can do due to the nature of the algorithm, and even where you can split (e.g. assigning the Q, K and V matrices to separate GPUs since they are parallel in the model) it's hard to overcome the overhead of moving data between devices. Especially with how weak sauce multithreading is in Python.

bani6809 commented 7 months ago

how about being able to convert multiple models at once? e.g. tell one instance of convert.py to only use gpu 0, and another instance of convert.py to only use gpu 1?

turboderp commented 7 months ago

You can always do CUDA_VISIBLE_DEVICES=0 python convert.py ... and then CUDA_VISIBLE_DEVICES=1 python convert.py ... in another shell, and so on. That works fine.

sophosympatheia commented 7 months ago

You can always do CUDA_VISIBLE_DEVICES=0 python convert.py ... and then CUDA_VISIBLE_DEVICES=1 python convert.py ... in another shell, and so on. That works fine.

This is what I do and it works great. Thanks for all your work on this project, @turboderp!