modular-ml / wrapyfi-examples_llama

Inference code for facebook LLaMA models with Wrapyfi support
GNU General Public License v3.0
130 stars 14 forks source link

What's the total WPS when i use multi-gpu to work for inference? #7

Closed hailiyidishui closed 1 year ago

hailiyidishui commented 1 year ago
GPU ID \ Type CPU Mem. \ Power GPU Mem. WPS
0 | TITAN Xp 12GB 2.4 GB | 79 W / 250 W 5.6 GB 12
1 | TITAN Xp 12GB 1.3 GB | 63 W / 250 W 5.6 GB 12
2 | TITAN X 12GB 1.3 GB | 89 W / 250 W 5.5 GB 12
3 | TITAN X 12GB 1.3 GB | 99 W / 250 W 6.2 GB 12

I've tried to use multi-gpu(here is 4) for inference with 7B params, but i found the GPU util is very lower than single GPU. It seems wrapyfi with llama can't fully utilize the performance of GPU. For the performance mentioned in the README, what's the total WPS should I use? 12 or 4x12?

fabawi commented 1 year ago

They all exchange tensors in sequence since the transformer blocks are split across the GPUs. The 12 WPS applies to all, so it's 12 in total. Running on a single GPU should of course be more efficient, but you don't really have that option when your VRAM is limited.

I'm not sure what you mean by fully utilizing the performance of a GPU. Your tensors are exchanged over the network (or loopback if on a single machine). Not only that, but you have to take into account the encoding/decoding of the tensors. In practice, it makes more sense to distribute 7B on 2 machines only. If you can fit it on 1, even better, and you could just use the original implementation of llama.

The example shown above demonstrates that you could distribute llama over more machines than just 2, but 4, or even 8? 16?; whatever you like.

hailiyidishui commented 1 year ago

Thanks Fabawi, I originally thought I could improve the output speed of reasoning by increasing the number of GPUs, but I found that there was no significant difference in the speed of reasoning with 1 card and 4 cards, but with 4 cards, the GPU usage was only about 20%, so I wondered if there was a way to increase the GPU usage and thus improve the speed of inference.