pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.54k stars 22.76k forks source link

Please increase "Number of CUDA devices" recognized by pytorch. #115331

Open Qubitium opened 11 months ago

Qubitium commented 11 months ago

🚀 The feature, motivation and pitch

RuntimeError: Number of CUDA devices on the machine is larger than the compiled max number of gpus expected (16). Increase that and recompile.

https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDAMacros.h

Current macro is set to 16. We have 19 gpus on a single machine. Please increase this value to 32 or better, 2048 for future proofing. As virtualization of gpu slices become prevalent, 16 is not enough. For example, even a single 8x A100 node can be MIGed into more than 16 slices.

Alternatives

No response

Additional context

No response

cc @ptrblck

tringwald commented 11 months ago

I guess there are quite some limitations that will prevent you from using more than 16 GPUs efficiently on a single node and a single PyTorch process (CUDA_VISIBLE_DEVICES should still work to enable multiple PyTorch instances). With regard to MIG devices: a single CUDA process can only use at most one MIG device at a time (see here), so the limit should not be a problem.

Nonetheless, I can prepare a PR to increase the limit.

Qubitium commented 11 months ago

Depending on usage, inference or training, I would say a great percentage of users will never hit any pcie bandwidth bottleneck due to extra pcie bridge/switching.

I guess there are quite some limitations that will prevent you from using more than 16 GPUs efficiently on a single node

colesbury commented 11 months ago

@albanD - do you know why we have a hard coded limit? Is it something with a limited number of bits in the C++ torch.device representation?

tringwald commented 11 months ago

@albanD - do you know why we have a hard coded limit? Is it something with a limited number of bits in the C++ torch.device representation?

Might be because we preallocate some CUDAStreams and CuDNNStates. My PR linked above increases the limit to 64, which seems to compile and pass all tests without errors. However, I don't have a way to actually test it on a machine with 16+ GPUs (and I guess neither does the test pipeline).

Qubitium commented 11 months ago

It would also be prudent to set it beyond 64. Current Nvidia sanctioned 8xA100/H100 can MIG into 7 slices each resulting in 56. Some of the 8xA100/H100 chassis have 11 pcie slots for potential max 11 xA100 x 7 = 77 slices in 2023, without any plx switching. With plx switch extension, that value can be obliterated. Unless there is physical cuda bit limit that nvidia needs to fix, this should be set to something very large, such as 1024/2048 for future proofing.

tringwald commented 11 months ago

It would also be prudent to set it beyond 64. Current Nvidia sanctioned 8xA100/H100 can MIG into 7 slices each resulting in 56. Some of the 8xA100/H100 chassis have 11 pcie slots for potential max 11 xA100 x 7 = 77 slices in 2023, without any plx switching. With plx switch extension, that value can be obliterated. Unless there is physical cuda bit limit that nvidia needs to fix, this should be set to something very large, such as 1024/2048 for future proofing.

Again, a CUDA process can only use a single MIG device right now:

ubuntu@hostname:~$ nvidia-smi 
Sat Dec  9 11:56:27 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                   On |
| N/A   33C    P0    48W / 400W |     24MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   3  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   4  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   5  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   6  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
>>> import torch
>>> torch.cuda.device_count()
1

If the overhead of the statically compiled value is negligible, we can surely increase it further.

albanD commented 11 months ago

On the PyTorch side this hard coded limit is not really required beyond making global state handling simpler (just use static arrays instead of having to handle growing them dynamically). Caffe2 seems to be doing more complex logic with this (in particular to register nccl ops) so that might have a bigger impact there.

I think that bumping it to 64 should be pretty safe and we can do that. I'm a bit weary to go to very large values right now without knowing the details on the caffe2 side.

albanD commented 11 months ago

Re-opening since the PR has been reverted

deke997 commented 9 months ago

Hi @tringwald @albanD

I am running into this issue again with the new cap of 64 GPUs. The TensorWave architecture currently supports 80 GPUs per node and plans to support 128 and 256 GPUs per node later this year.

Can we increase this value further to support this architecture and provide some future proofing?

tringwald commented 9 months ago

128 would be possible, I guess. However, our DeviceIndex is currently an int8_t, so everything above 255 would need some larger changes.

albanD commented 9 months ago

The change of device index to int16_t might be a bit of work. But sounds fair.

deke997 commented 9 months ago

If that change will take a significant amount of work, can we max out the int8_t value and merge that first and then work on the int16_t change after?

Thanks!

tringwald commented 9 months ago

I'm almost done with the int16_t changes. There's just one legacy JIT unit test that keeps failing on some ancient Python version ...

Qubitium commented 9 months ago

@tringwald Possible to just use the new int_16 max allowable of 32767 and forget about this problem for the next couple of years?

albanD commented 9 months ago

We have static objects based on this size. So I don't think we want to bump it too much. I don't think it's too much of a problem to keep increasing it as available hardware gets updated.

deke997 commented 9 months ago

Hi @tringwald,

Given the new issues with the meta internal build, can we take the intermediate step to increase compile time max GPUs to 255 with int8? We can then sort out the upgrade to int16 when we have feedback from meta.

That would solve our short term issue and buy more time for the long term int16 solution

Thanks!

Qubitium commented 9 months ago

@deke997 The problem is int8 is only 127 max.

deke997 commented 9 months ago

https://github.com/pytorch/pytorch/issues/115331#issuecomment-1937002056

128 would be possible, I guess. However, our DeviceIndex is currently an int8_t, so everything above 255 would need some larger changes.

https://github.com/pytorch/pytorch/pull/119639#user-content-fn-1-b07af97368055281f2f3a106da5d0181

This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the ArgumentInfo struct. When I switched the DeviceIndex to int16_t, it actually stayed 255 after unpacking from ArgumentInfo again, as the DeviceIndex was now wide enough that it didn't wrap back to -1.

tringwald commented 9 months ago

128 should be pretty straightforward, but DeviceIndex is a signed int8 right now, so using the full 255 indices + 1 default device would require some rework.

Qubitium commented 9 months ago

Looking at the code I can assume that all numbers that is exposed needs to be signed integers since python itself has no unsigned int. In c/c++ you can cast/switch int/uint all you want but considering input/out to/from python is signed, it should be signed especially if the code already uses a default value of -1.

With that said, using -1 in an unsigned context is just asking for trouble as the second comment eluded to.

tringwald commented 9 months ago

Right now, we actually do not allow negative device indices from/to the Python side. Device index -1 is only a convention on the C++ side. But I agree, we should probably keep it signed nonetheless, as usage of -1 is very widespread and relying on integer underflow is just asking for trouble.

tringwald commented 8 months ago

Reopening until the int16_t DeviceIndex is merged.