Closed ymwangg closed 2 months ago
If disabling custom allreduce increases the cudagraph memory, then i suppose this is caused by NCCL. It is machine topology dependent.
I have a script to test this. You can have a try. It can tell how much memory nccl costs.
Is the increased memory usage only happening on A10G when cudagraphs is used? What's the memory usage without using cuda graphs?
It would also be helpful to run with env var NCCL_DEBUG=INFO
and see the logs from that.
@trevor-m Hi Trevor, nice to see you here. This issue tends out to be related to cuda driver version.
Update: This discrepancy is due to cuda driver version. Using cuda 12.4 or later significantly reduces the cudagraph memory usage.
gpu type | cuda_12.2.2_535.104.05 | cuda_12.3.2_545.23.08 | cuda_12.4.1_550.54.15 |
---|---|---|---|
a10g | 3.8086 | 3.8086 | 1.1367 |
L4 | 3.7891 | 3.7891 | 1.1367 |
@ymwangg do you mean cuda runtime version or cuda driver version?
@youkaichao cuda driver version. In the experiment, I reinstalled cuda driver but the cuda toolkit is kept unchanged (12.1).
iiuc, cuda driver version is something like 555.42.02
. The cuda version CUDA Version: 12.5
is just the highest cuda runtime version it can support. See the documentation for details.
Can you report the cuda driver version instead?
Right. I've updated the table.
Your current environment
🐛 Describe the bug
I noticed the memory consumption of Cudagraph with tensor parallelism on G5/G6 instances (A10G/L4 GPUs) is significantly higher than P4d instances (A100 GPU). I'm not sure if this is expected due to lack of nvlink support. It would be great if it can be mitigated since A10G/L4 GPUs have smaller memory capacity.
Below is the memory consumption of cudagraph on different GPUs.
cc @youkaichao any suggestions?
Update: This discrepancy is due to cuda driver versions. Using cuda 12.4 or later significantly reduces the cudagraph memory usage.