Closed hailiyidishui closed 1 year ago
They all exchange tensors in sequence since the transformer blocks are split across the GPUs. The 12 WPS applies to all, so it's 12 in total. Running on a single GPU should of course be more efficient, but you don't really have that option when your VRAM is limited.
I'm not sure what you mean by fully utilizing the performance of a GPU. Your tensors are exchanged over the network (or loopback if on a single machine). Not only that, but you have to take into account the encoding/decoding of the tensors. In practice, it makes more sense to distribute 7B on 2 machines only. If you can fit it on 1, even better, and you could just use the original implementation of llama.
The example shown above demonstrates that you could distribute llama over more machines than just 2, but 4, or even 8? 16?; whatever you like.
Thanks Fabawi, I originally thought I could improve the output speed of reasoning by increasing the number of GPUs, but I found that there was no significant difference in the speed of reasoning with 1 card and 4 cards, but with 4 cards, the GPU usage was only about 20%, so I wondered if there was a way to increase the GPU usage and thus improve the speed of inference.
I've tried to use multi-gpu(here is 4) for inference with 7B params, but i found the GPU util is very lower than single GPU. It seems wrapyfi with llama can't fully utilize the performance of GPU. For the performance mentioned in the README, what's the total WPS should I use? 12 or 4x12?