Closed fengyang95 closed 2 months ago
Ok, if you want to put all modules to gpu, then your required VRAM per layer is approximately 5.2G. So 9 layers on one GPU may cause oom. You can offload one layer of experts to CPU
Ok, if you want to put all modules to gpu, then your required VRAM per layer is approximately 5.2G. So 9 layers on one GPU may cause oom. You can offload one layer of experts to CPU @Azure-Tang But I only put 8 layers on each GPU, why is it still OOM? Do I need to reserve some space?
The 5.2G figure is an approximation, and the actual value may be slightly higher. As a result, 8 layers per GPU could be the upper limit. You can consider offloading one layer’s experts to the CPU for each GPU. This way, each CPU will handle 8 layers of experts modules, with the remaining layers calculated in the GPU.
The 5.2G figure is an approximation, and the actual value may be slightly higher. As a result, 8 layers per GPU could be the upper limit. You can consider offloading one layer’s experts to the CPU for each GPU. This way, each CPU will handle 8 layers of experts modules, with the remaining layers calculated in the GPU.
Do you mean to allocate 7 layers to each GPU and offload 1 layer to the CPU?
Not a whole layer
to the CPU, only offload experts
module in one decoder layer for each 8-layer GPU seems enough.
Not
a whole layer
to the CPU, only offloadexperts
module in one decoder layer for each 8-layer GPU seems enough.
I didn't quite understand your point. Could you pls demonstrate it using a configuration?
For example, you got 8 layers for each GPU, which cause oom. And your 8 layers' parameter distribution is like this:
Then you offload one layer's experts module to cpu (which we have done in our example yamls), so the layer will be like this:
And keep other 7 whole layers on GPU, which like this:
Which will offload some parameters to you ram. If it still oom, you can do this for another layer.
Hi, I’ll be closing this topic as there hasn’t been any response~
I tried running deepseek-v2 with an 8xL40 46G configuration, but I encountered a GPU memory Out-of-Memory (OOM) error. Why would such a large amount of GPU memory still lead to an OOM issue?