Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny`

pentschev commented 2 years ago

The primary reason the CUDA_VISIBLE_DEVICES environment variable was used in the past was that the process needed to know of other GPUs for CUDA IPC to work correctly, so we passed that as an environment variable to Nanny who then sets os.environ before spawning each new worker process, which tends to become problematic, see https://github.com/dask/distributed/issues/3682 and https://github.com/dask/distributed/pull/6681. I’ve heard diverging info in the recent past as to whether the CUDA IPC limitation still applies or not, but is definitely worth testing to verify.

Assuming CUDA IPC has no problem with that anymore, we could get rid of passing the CUDA_VISIBLE_DEVICES environment variable to Nanny and instead run cudaSetDevice() as a Dask preload script.

wence- commented 2 years ago

So, suppose that there are 4 GPUs in the system, but CUDA_VISIBLE_DEVICES=1,3. Then:

from cuda import cuda
cuda.cuInit(0)
cuda.cuDeviceCanAccessPeer(0, 1) # can device 1 talk to device 3?
=> (<CUresult.CUDA_SUCCESS: 0>, 1)

The following minimal thing appears to work:

from cuda import cudart, cuda
import numpy as np

cuda.cuInit(0)
(<CUresult.CUDA_SUCCESS: 0>,)

cudart.cudaSetDevice(0)
(<cudaError_t.cudaSuccess: 0>,)

_, ptr0 = cudart.cudaMalloc(1024)

srcbuf = np.ones(1024, dtype=np.uint8)

cudart.cudaMemcpy(ptr0, srcbuf.ctypes.data, 1024, cudart.cudaMemcpyKind.cudaMemcpyDefault)
cudart.cudaDeviceEnablePeerAccess(1, 0)
cudart.cudaSetDevice(1)
_, ptr1 = cudart.cudaMalloc(1024)
cudart.cudaDeviceEnablePeerAccess(0, 0)
cudart.cudaMemcpy(ptr1, ptr0, 1024, cudart.cudaMemcpyKind.cudaMemcpyDefault)
buf = np.empty_like(srcbuf)

dstbuf = np.empty_like(srcbuf)

cudart.cudaMemcpy(dstbuf, ptr1, 1024, cudart.cudaMemcpyKind.cudaMemcpyDefault)
dstbuf => array([1, 1, 1, ..., 1, 1, 1], dtype=uint8)

But I don't know if I expect this to work without the peer access

pentschev commented 2 years ago

In the case above you're only using cudaMemcpy, which will work either way, even if it has to go through host via PCIe. The original problem was with CUDA IPC only, which requires mapping IPC buffers between processes that is achieved with cudaIpc* API.

The easiest way to test is actually to just test with UCX-Py between peers. Additionally, to confirm with Dask I would just bypass the behavior for cuda_visible_devces to list all devices by just the current devices, e.g., instead of CUDA_VISIBLE_DEVICES="4,5,6,7,0,1,2,3 to CUDA_VISIBLE_DEVICES="4", and verify if performance remains the same, if it does we're probably good to get rid of that.

wence- commented 2 years ago

The easiest way to test is actually to just test with UCX-Py between peers.

IIUC:

# let both processes see both devices
$ CUDA_VISIBLE_DEVICES=0,1 python send-recv-core.py --reuse-alloc -d 0 -o cupy -b 0 --n-bytes 1GB --n-iter 10 --server-only -p 45000 &
$ CUDA_VISIBLE_DEVICES=0,1 python send-recv-core.py --reuse-alloc -e 1 -o cupy -c 1 --n-bytes 1GB --n-iter 10 --client-only -p 45000
Device(s)                 | 0, 1
================================================================================
Bandwidth (average)       | 22.56 GiB/s
Bandwidth (median)        | 22.58 GiB/s

Compared to:

# Only let each process see a single device
$ CUDA_VISIBLE_DEVICES=0 python send-recv-core.py --reuse-alloc -d 0 -o cupy -b 0 --n-bytes 1GB --n-iter 10 --server-only -p 45000 &
$ CUDA_VISIBLE_DEVICES=1 python send-recv-core.py --reuse-alloc -e 0 -o cupy -c 1 --n-bytes 1GB --n-iter 10 --client-only -p 45000
Device(s)                 | 0, 0
================================================================================
Bandwidth (average)       | 22.56 GiB/s
Bandwidth (median)        | 22.58 GiB/s

pentschev commented 2 years ago

Yes, that looks correct. Were you able to confirm the same on Dask-CUDA with multiple workers?

wence- commented 2 years ago

If I do:

$ dask-scheduler --protocol ucx --scheduler-file foo.json &
$ for i in $(seq 0 7); do CUDA_VISIBLE_DEVICES=$i dask-cuda-worker --scheduler-file foo.json --protocol ucx &; done
$ python local_cudf_merge.py --protocol ucx --runs 10 -c 100_000_000 --scheduler-file foo.json
...
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
1.25 s                    | 19.09 GiB/s
1.16 s                    | 20.47 GiB/s
937.36 ms                 | 25.44 GiB/s
1.13 s                    | 21.03 GiB/s
1.07 s                    | 22.26 GiB/s
1.14 s                    | 20.91 GiB/s
947.17 ms                 | 25.17 GiB/s
922.02 ms                 | 25.86 GiB/s
924.01 ms                 | 25.80 GiB/s
1.14 s                    | 20.85 GiB/s
================================================================================
Throughput                | 22.42 GiB/s +/- 781.06 MiB/s
Bandwidth                 | 3.48 GiB/s +/- 378.12 MiB/s
Wall clock                | 1.06 s +/- 114.38 ms

Compared to:

$ python local_cudf_merge.py --protocol ucx -d 0,1,2,3,4,5,6,7 -c 100_000_000 --runs 10
...
Wall clock                | Throughput
--------------------------------------------------------------------------------
1.30 s                    | 18.37 GiB/s
957.49 ms                 | 24.90 GiB/s
1.15 s                    | 20.75 GiB/s
1.11 s                    | 21.50 GiB/s
1.13 s                    | 21.15 GiB/s
955.55 ms                 | 24.95 GiB/s
1.04 s                    | 22.85 GiB/s
1.14 s                    | 21.00 GiB/s
1.11 s                    | 21.57 GiB/s
1.15 s                    | 20.74 GiB/s
================================================================================
Throughput                | 21.62 GiB/s +/- 603.04 MiB/s
Bandwidth                 | 3.68 GiB/s +/- 411.84 MiB/s
Wall clock                | 1.10 s +/- 95.02 ms

pentschev commented 2 years ago

It looks fine, considering the known instability of results between iterations. Just to have one last metric, could you also run both by specifying UCX_TLS=^cuda_ipc?

wence- commented 2 years ago

with export UCX_TLS=^cuda_ipc:

$ dask-scheduler --protocol ucx --scheduler-file foo.json &
$ for i in $(seq 0 7); do CUDA_VISIBLE_DEVICES=$i dask-cuda-worker --scheduler-file foo.json --protocol ucx &; done
$ python local_cudf_merge.py --protocol ucx --runs 10 -c 100_000_000 --scheduler-file foo.json
...
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
1.96 s                    | 12.18 GiB/s
1.67 s                    | 14.26 GiB/s
1.52 s                    | 15.68 GiB/s
1.70 s                    | 14.04 GiB/s
1.53 s                    | 15.63 GiB/s
1.76 s                    | 13.56 GiB/s
1.68 s                    | 14.18 GiB/s
1.52 s                    | 15.65 GiB/s
1.53 s                    | 15.60 GiB/s
1.80 s                    | 13.22 GiB/s
================================================================================
Throughput                | 14.30 GiB/s +/- 387.27 MiB/s
Bandwidth                 | 858.90 MiB/s +/- 4.64 MiB/s
Wall clock                | 1.67 s +/- 139.38 ms

$ python local_cudf_merge.py --protocol ucx -d 0,1,2,3,4,5,6,7 -c 100_000_000 --runs 10
...
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
1.61 s                    | 14.81 GiB/s
1.56 s                    | 15.25 GiB/s
1.51 s                    | 15.79 GiB/s
1.68 s                    | 14.20 GiB/s
1.53 s                    | 15.55 GiB/s
1.53 s                    | 15.63 GiB/s
1.69 s                    | 14.14 GiB/s
1.61 s                    | 14.81 GiB/s
1.65 s                    | 14.44 GiB/s
1.55 s                    | 15.39 GiB/s
================================================================================
Throughput                | 14.98 GiB/s +/- 187.71 MiB/s
Bandwidth                 | 808.13 MiB/s +/- 3.34 MiB/s
Wall clock                | 1.59 s +/- 61.59 ms

So those are also basically the same (lower overall throughput which makes sense because no ipc).

pentschev commented 2 years ago

Yup, agreed this looks right. It seems we indeed don't need CUDA_VISIBLE_DEVICES to list all GPUs, only the first one should suffice. Therefore, the idea I have in mind is for us to write small Dask preload class similar to the CPU affinity one:

https://github.com/rapidsai/dask-cuda/blob/435dae8e2dabd07de3451ec2e4a0cb3f4efe0ec8/dask_cuda/utils.py#L30-L35

What we would do in that case is to call numba.cuda.current_context(devnum=N), where N is the index for the device the worker should use. However, it may not be that simple given numba.cuda.current_context() only seems to support a device number and we need it to work with UUIDs as well, primarily for MIG support, but UUIDs are useful as well to maintain device ordering between reboots. For what we're currently expected to support see: https://github.com/rapidsai/dask-cuda/blob/435dae8e2dabd07de3451ec2e4a0cb3f4efe0ec8/dask_cuda/utils.py#L473-L502

One potential solution is using cuda-python instead of numba. A possible problem with that is how users installation is expected to work with CUDA, given Dask-CUDA also supports PyPI. I'm not sure if there are any culprits to supporting cuda-python given it's a new-ish package, it's possible it is just expected to work in any environment where Numba and CuPy work as well, in which case we would be fine. If there are more limiting requirements, we may need to consider what other options exist to manage CUDA contexts without strictly depending on cuda-python, which I believe would mean working with Numba to add support for UUID-based context creation.

wence- commented 2 years ago

What we would do in that case is to call numba.cuda.current_context(devnum=N), where N is the index for the device the worker should use. However, it may not be that simple given numba.cuda.current_context() only seems to support a device number and we need it to work with UUIDs as well, primarily for MIG support, but UUIDs are useful as well to maintain device ordering between reboots. For what we're currently expected to support see:

I think this is easy because:

import numba.cuda

uuid2devnum = dict((dev.uuid, dev.id) for dev in numba.cuda.list_devices())

jakirkham commented 2 years ago

Once we decide on a replacement, it would be good to solicit feedback from other users to make sure it still satisfies their use cases

pentschev commented 2 years ago

I think this is easy because:

import numba.cuda

uuid2devnum = dict((dev.uuid, dev.id) for dev in numba.cuda.list_devices())

Sorry, I missed this, but I'd be skeptical if this really works for MIG, it's definitely something we would need to try out.

Also, I suggested using a preload script but I was thinking about it this morning and I remembered this is not going to work. The problem is Dask initializes comms before preload scripts (see https://github.com/dask/distributed/pull/3193), and we need the CUDA context to exist before UCX is initialized, this is the reason we have https://github.com/dask/distributed/blob/a53858a5fc73b591da5a6ceff903a468990f8a21/distributed/comm/ucx.py#L94-L123

So we will still need to keep context initialization somewhere within comms or before it, which I think we can't do without some sort of monkey patching, so it is likely preferrable to keep that within UCX initialization.

pentschev commented 2 years ago

Once we decide on a replacement, it would be good to solicit feedback from other users to make sure it still satisfies their use cases

Do you have a specific use case in mind John? We definitely need to test this well, but interested in who are the other users that may have problems with it, given one of the purposes of Dask-CUDA is indeed handling CUDA context so that the user doesn't need to.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

trivialfis commented 2 years ago

Hi, if this feature is planned for future release, would be great if there's a method abstracted by dask for downstream projects to obtain the GPU ordinal. Currently, XGBoost uses device 0 as the default for dask, which works only if CUDA_VISIABLE_DEVICES has the correct permutation. We can support explicit device ordinal when needed but would be great to have some well-defined methods to obtain such information.

pentschev commented 2 years ago

Hi, if this feature is planned for future release, would be great if there's a method abstracted by dask for downstream projects to obtain the GPU ordinal. Currently, XGBoost uses device 0 as the default for dask, which works only if CUDA_VISIABLE_DEVICES has the correct permutation. We can support explicit device ordinal when needed but would be great to have some well-defined methods to obtain such information.

The reason we're not able to move forward on this is that all 3rd-party libraries respect CUDA_VISIBLE_DEVICES, but unfortunately there's no unique API respected by all that would allow Dask to interact safely with such libraries. Our best shot is that eventually cuda-python will become the de-facto Python API for all projects and then we could resolve it without requiring environment variables. Until then, this will probably need to remain as it is, unfortunately.

wence- commented 2 years ago

I made some progress (with the goal of using cuda-python), unfortunately that exposed some bugs in cuda-python which are fixed (but not yet in a version that we can use for reasons). Once that is available, we will try again and then any call into the cuda runtime API querying the current device will return a consistent view of the world.

Although we would be able to provide a wrapper for that query, I would rather just depend on cuda-python and let downstream packages query the cuda runtime (via cuda-python or cupy or numba).

trivialfis commented 2 years ago

That's great news!

I think it's still desirable for dask to provide an abstract method for downstream projects since dask excels at resource management as a distributed framework. We don't want to obtain resource usage ourselves through non-direct dependencies (like cuda-python and cupy).

rapidsai / dask-cuda

Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny` #954