Open Qubitium opened 11 months ago
I guess there are quite some limitations that will prevent you from using more than 16 GPUs efficiently on a single node and a single PyTorch process (CUDA_VISIBLE_DEVICES
should still work to enable multiple PyTorch instances).
With regard to MIG devices: a single CUDA process can only use at most one MIG device at a time (see here), so the limit should not be a problem.
Nonetheless, I can prepare a PR to increase the limit.
Depending on usage, inference or training, I would say a great percentage of users will never hit any pcie bandwidth bottleneck due to extra pcie bridge/switching.
I guess there are quite some limitations that will prevent you from using more than 16 GPUs efficiently on a single node
@albanD - do you know why we have a hard coded limit? Is it something with a limited number of bits in the C++ torch.device representation?
@albanD - do you know why we have a hard coded limit? Is it something with a limited number of bits in the C++ torch.device representation?
Might be because we preallocate some CUDAStreams
and CuDNNStates
. My PR linked above increases the limit to 64, which seems to compile and pass all tests without errors. However, I don't have a way to actually test it on a machine with 16+ GPUs (and I guess neither does the test pipeline).
It would also be prudent to set it beyond 64. Current Nvidia sanctioned 8xA100/H100 can MIG into 7 slices each resulting in 56. Some of the 8xA100/H100 chassis have 11 pcie slots for potential max 11 xA100 x 7 = 77 slices in 2023, without any plx switching. With plx switch extension, that value can be obliterated. Unless there is physical cuda bit limit that nvidia needs to fix, this should be set to something very large, such as 1024/2048 for future proofing.
It would also be prudent to set it beyond 64. Current Nvidia sanctioned 8xA100/H100 can MIG into 7 slices each resulting in 56. Some of the 8xA100/H100 chassis have 11 pcie slots for potential max 11 xA100 x 7 = 77 slices in 2023, without any plx switching. With plx switch extension, that value can be obliterated. Unless there is physical cuda bit limit that nvidia needs to fix, this should be set to something very large, such as 1024/2048 for future proofing.
Again, a CUDA process can only use a single MIG device right now:
ubuntu@hostname:~$ nvidia-smi
Sat Dec 9 11:56:27 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:06:00.0 Off | On |
| N/A 33C P0 48W / 400W | 24MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 8 0 1 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 3 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 4 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 5 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 6 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
>>> import torch
>>> torch.cuda.device_count()
1
If the overhead of the statically compiled value is negligible, we can surely increase it further.
On the PyTorch side this hard coded limit is not really required beyond making global state handling simpler (just use static arrays instead of having to handle growing them dynamically). Caffe2 seems to be doing more complex logic with this (in particular to register nccl ops) so that might have a bigger impact there.
I think that bumping it to 64 should be pretty safe and we can do that. I'm a bit weary to go to very large values right now without knowing the details on the caffe2 side.
Re-opening since the PR has been reverted
Hi @tringwald @albanD
I am running into this issue again with the new cap of 64 GPUs. The TensorWave architecture currently supports 80 GPUs per node and plans to support 128 and 256 GPUs per node later this year.
Can we increase this value further to support this architecture and provide some future proofing?
128 would be possible, I guess. However, our DeviceIndex
is currently an int8_t
, so everything above 255 would need some larger changes.
The change of device index to int16_t
might be a bit of work. But sounds fair.
If that change will take a significant amount of work, can we max out the int8_t
value and merge that first and then work on the int16_t
change after?
Thanks!
I'm almost done with the int16_t
changes. There's just one legacy JIT unit test that keeps failing on some ancient Python version ...
@tringwald Possible to just use the new int_16 max allowable of 32767 and forget about this problem for the next couple of years?
We have static objects based on this size. So I don't think we want to bump it too much. I don't think it's too much of a problem to keep increasing it as available hardware gets updated.
Hi @tringwald,
Given the new issues with the meta internal build, can we take the intermediate step to increase compile time max GPUs to 255 with int8? We can then sort out the upgrade to int16 when we have feedback from meta.
That would solve our short term issue and buy more time for the long term int16 solution
Thanks!
@deke997 The problem is int8 is only 127 max.
https://github.com/pytorch/pytorch/issues/115331#issuecomment-1937002056
128 would be possible, I guess. However, our
DeviceIndex
is currently anint8_t
, so everything above 255 would need some larger changes.
https://github.com/pytorch/pytorch/pull/119639#user-content-fn-1-b07af97368055281f2f3a106da5d0181
This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the ArgumentInfo struct. When I switched the DeviceIndex to int16_t, it actually stayed 255 after unpacking from ArgumentInfo again, as the DeviceIndex was now wide enough that it didn't wrap back to -1.
128 should be pretty straightforward, but DeviceIndex
is a signed int8
right now, so using the full 255 indices + 1 default device would require some rework.
Looking at the code I can assume that all numbers that is exposed needs to be signed integers since python itself has no unsigned int. In c/c++ you can cast/switch int/uint all you want but considering input/out to/from python is signed, it should be signed especially if the code already uses a default value of -1.
With that said, using -1 in an unsigned context is just asking for trouble as the second comment eluded to.
Right now, we actually do not allow negative device indices from/to the Python side. Device index -1 is only a convention on the C++ side. But I agree, we should probably keep it signed nonetheless, as usage of -1 is very widespread and relying on integer underflow is just asking for trouble.
Reopening until the int16_t
DeviceIndex
is merged.
🚀 The feature, motivation and pitch
https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDAMacros.h
Current macro is set to 16. We have 19 gpus on a single machine. Please increase this value to 32 or better, 2048 for future proofing. As virtualization of gpu slices become prevalent, 16 is not enough. For example, even a single 8x A100 node can be MIGed into more than 16 slices.
Alternatives
No response
Additional context
No response
cc @ptrblck