Open MikeChenfu opened 1 year ago
Hi @MikeChenfu
I think you misunderstand the meaning of the arguement gpu_margin_mem_ratio
. When using auto
policy in Gemini, we will automatically detect your GPU memory usage and try to make full use of your CUDA memory. Gemini leave as much as possible parameters in CUDA during training by default, but some users want to pose optimizer states in CUDA and update a part of parameters with GPU. gpu_margin_mem_ratio
means the ratio storing optimizer states in the gap between your maximum CUDA memory usage and your full CUDA capacity.
As for the problem in multiple nodes, we may fix this bug soon.
Thanks @1SAA for the update. Previously I had to adjust gpu_margin_mem_ratio
for better performance. It is good to hear automatically detect your GPU memory usage. Does it mean I only use auto
policy without passing gpu_margin_mem_ratio
as an input parameter?
I think you misunderstand the meaning of the arguement
gpu_margin_mem_ratio
. When usingauto
policy in Gemini, we will automatically detect your GPU memory usage and try to make full use of your CUDA memory. Gemini leave as much as possible parameters in CUDA during training by default, but some users want to pose optimizer states in CUDA and update a part of parameters with GPU.gpu_margin_mem_ratio
means the ratio storing optimizer states in the gap between your maximum CUDA memory usage and your full CUDA capacity.
If you want to store some or more optimizer states in CUDA and update a part of parameters with the GPU, you can increase the gpu_margin_mem_ratio.
π Describe the bug
Hello, I am training OPT model on the A100 GPUs. I found it used 76GB GPU memory when I use
auto
mode and setgpu_margin_mem_ratio
as 0. If I usecpu
mode, it only takes about 15GB. In my understanding, both two methods should use the same GPU memory.Also I got different connection errors when I use auto mode and set the
gpu_margin_mem_ratio
as non-zero like 0.2 within two nodes. It works well on the single node but seemsgpu_margin_mem_ratio
value does not control GPU memory usage.Environment