Closed VibhuJawa closed 3 years ago
Adding the dask reports for ucx/tcp:
The notable issue in the task graph is that for TCP there is more communication but the workers are fully engaged (no white space):
For UCX, there is less communication workers are not actively working on a particular task (white space):
Thanks Vibhu for filing and Ben for including the profiling plot.
It looks like the read time on the worker is taking ~1.5x longer in the UCX case than the TCP case. This may be helped by releasing the GIL during these operations. ( https://github.com/rapidsai/ucx-py/pull/391 )
We also are spending a lot of time in generating the __cuda_array_interface__
in Numba. Not sure why that is the case. Will look at this more closely.
We are also spending more time working in the UCX report (3-4x the time). I'm going to be raising a similar issue with a reproducible example for groupby-aggregation behavior, today
Ah was looking at the gaps initially based on Ben's comment. Should I be looking at something else?
Ah was looking at the gaps initially based on Ben's comment. Should I be looking at something else?
I just wanted to highlight the actual time spent doing compute is materially different, which may or may not related but also naively feels like a potential contributor to the slowdown.
Meaning the overall runtime? If so, then I think Ben has the right idea to look at gaps. If not, some clarity on the compute time you are looking at would be helpful 🙂
Yes, sorry for being unclear 😄
From the worker profile in the linked performance report, we spend 130 total seconds doing "compute" with UCX on across the 16 GPUs. We only spend 40 total seconds with TCP across the 16 GPUs.
_concat
with UCX, vs 13 seconds with TCPshuffle_group
Another interesting thing to note is that deletion of dependencies takes quite a bit of time in UCX compared with TCP. This can be seen in the Worker Profile (administrative)
page:
UCX
TCP
No worries. Thanks Nick! Just wanted to make sure we are focusing on the pain point 😉
Yeah that's what I was looking at above, Ben. Agree this is what we should focus on.
It want to note that these issue could be a symptom of poor work stealing and we are still digging into this idea
That may be true.
There is also some cost in building Numba arrays, which we are doing. I've done some work to speed this up. Am still trying to figure out the right way to hook this into what we are doing here.
In particular please see this rough benchmark.
Here are some timings I gathered with this code:
TCP
Create time: 10.193811178207397
Merge time: 15.630461931228638
Create time: 10.504480838775635
Merge time: 15.304280996322632
Create time: 10.684969425201416
Merge time: 16.014200448989868
UCX (Master):
Create time: 10.919841766357422
Merge time: 22.28807020187378
Create time: 11.757182836532593
Merge time: 23.014573097229004
Create time: 11.073307514190674
Merge time: 22.2984356880188
UCX (v1.7.0):
Create time: 22.65068030357361
Merge time: 45.892472982406616
Create time: 21.901089429855347
Merge time: 44.46129822731018
Create time: 22.363646984100342
Merge time: 44.13629865646362
UCX (v1.7.0 + https://github.com/openucx/ucx/pull/4646):
Create time: 10.750166177749634
Merge time: 23.12590527534485
Create time: 11.505124568939209
Merge time: 22.653675079345703
Create time: 11.365516662597656
Merge time: 20.664494276046753
As we can see above, https://github.com/openucx/ucx/pull/4646 helps (and if cherry-picked to v1.7.0, matches performance of master) but TCP still outperforms UCX.
@quasiben and I explored the __delitem__
that was mentioned before, and just for a rough test we ended up commenting it out just to see if that would affect performance, but it turns out that it doesn't change performance at all.
In looking more thoroughly through the worker profile there seems to be a fair amount of time managing cuda contexts when using UCX:
@kkraus14 do you happen to have any insights here ?
Yeah that's what I was trying to get at with this comment.
Am still trying to find the right way to hook Dask into higher level RMM/dask-cuda. PR ( https://github.com/rapidsai/rmm/pull/264 ) is a start at that.
Also after a fairly long rabbit hole earlier today, I think rmm.auto_device
may be causing some of the problems here. Basically it might be copying data unnecessarily ( https://github.com/rapidsai/rmm/issues/265 ). It's mostly unused except for this one deserialization line in cuDF when using strings (probably affects us here). Keith and I are trying to remove that currently ( https://github.com/rapidsai/cudf/pull/4003 ).
thanks @jakirkham for looking into this
Have added PR ( https://github.com/rapidsai/rmm/pull/268 ) to RMM. This should make it easier to directly copy from a device pointer to a host buffer without going through a bunch of Numba machinery first. Am hoping this will be useful in cuDF and dask-cuda where we can benefit from removing some of this overhead.
I setup two dask-cuda-workers manually and ran with nvprof
to bette understand what was happening in cuda land:
==11522== Profiling application: /datasets/bzaitlen/miniconda3/envs/rapidsai-latest/bin/python /datasets/bzaitlen/miniconda3/envs/rapidsai-latest/bin/dask-cuda-worker ucx://10.33.227.163:8786 --enable-nvlink --enable-tcp-over-ucx
==11522== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 98.50% 324.77ms 1 324.77ms 324.77ms 324.77ms cuDevicePrimaryCtxRetain
0.90% 2.9561ms 97 30.474us 138ns 1.2527ms cuDeviceGetAttribute
0.47% 1.5458ms 1 1.5458ms 1.5458ms 1.5458ms cuDeviceTotalMem
0.08% 274.87us 2 137.43us 128.12us 146.75us cuDeviceGetName
0.05% 162.31us 1 162.31us 162.31us 162.31us cuMemGetInfo
0.00% 4.4480us 6 741ns 182ns 1.7090us cuDeviceGetCount
0.00% 3.9780us 1 3.9780us 3.9780us 3.9780us cuDeviceGetPCIBusId
0.00% 2.3980us 1 2.3980us 2.3980us 2.3980us cuCtxPushCurrent
0.00% 1.7690us 1 1.7690us 1.7690us 1.7690us cuInit
0.00% 1.7600us 4 440ns 150ns 799ns cudaGetDeviceCount
0.00% 1.6120us 3 537ns 480ns 611ns cuDeviceGet
0.00% 1.0650us 1 1.0650us 1.0650us 1.0650us cuDriverGetVersion
0.00% 951ns 1 951ns 951ns 951ns cuCtxGetCurrent
0.00% 742ns 1 742ns 742ns 742ns cuDeviceComputeCapability
0.00% 251ns 1 251ns 251ns 251ns cuDeviceGetUuid
We are spending a lot of time with cuDevicePrimaryCtxRetain
this is happening within numba and we can see it in the worker profile as well:
I believe this is known to both @jakirkham and @kkraus14
Thanks for doing that Ben! 😄
This coming from Numba makes sense. Though it's nice to have the additional detail provided here.
Just to make sure we are not missing anything, do we know of any other situations (outside of Numba array creation) where cuDevicePrimaryCtxRetain
would be called?
Edit: Should add we are hoping PR ( https://github.com/rapidsai/rmm/pull/268 ) paves a short path for us to dispense with this overhead.
I checked cudf/rmm/ucx and I didn't see cuDevicePrimaryCtxRetain
anywhere except numba. It's worth noting that at this point we already should have a context created from the initialization step in dask-cuda. I also want to point out that the code lower in the stack comes from these two locations:
https://github.com/rapidsai/rmm/blob/branch-0.12/python/rmm/rmm.py#L142-L164
which calls:
https://github.com/rapidsai/rmm/blob/2e309eb37d2928db16ba5413fd25c9636e20c80f/python/rmm/rmm.py#L31
FWIW I tried adding __cuda_array_interface__
to Buffer
and coercing Buffer
objects to arrays as needed with PR ( https://github.com/rapidsai/cudf/pull/4023 ). This may help and is probably worth trying (assuming CI passes though can fix that tomorrow). It's probably a useful change for building other things in cuDF in the future (feedback on this is welcome).
That said, I wouldn't be surprised if we still see a fair amount of time spent in Numba due to legacy RMM allocation methods. So am planning on pushing on PR ( https://github.com/rapidsai/rmm/pull/268 ) more tomorrow in the hopes of quickly integrating this into cuDF and dask-cuda for further testing.
One thing that confuses me about the flame graph above (and hopefully someone can help answer this 🙂), it appears that creating an rmm.DeviceBuffer
is taking a lot of time. However these are generally very fast to construct. In fact it's much faster than anything else that we might use to allocate device memory (please see benchmark below). Given this, why does it appear that line is taking a long time? 🤔
In [1]: import rmm
In [2]: rmm.reinitialize(pool_allocator=True,
...: initial_pool_size=int(2 * 2**30))
Out[2]: 0
In [3]: %timeit rmm.DeviceBuffer(size=50_000_000)
360 ns ± 1.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: import cupy
In [5]: %timeit cupy.empty((50_000_000,), dtype="u1")
2.15 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: import numba.cuda
In [7]: %timeit numba.cuda.device_array((50_000_000,), dtype="u1")
1.39 ms ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Of course if the pool were not actually enabled, that would present a problem.
In [1]: import rmm
In [2]: rmm.reinitialize(pool_allocator=False)
Out[2]: 0
In [3]: %timeit rmm.DeviceBuffer(size=50_000_000)
1.12 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that profiling dask-cuda-worker
with nvprof
will only gather information from the parent process. Internally, dask-cuda-worker
uses Nanny
, and that will fork/spawn new processes which are not catched by nvprof
. Also note that the output from https://github.com/rapidsai/ucx-py/issues/402#issuecomment-580533448 doesn't have any calls to cudaMalloc
nor kernels launches, which hints that that process is not doing any useful CUDA work. Finally, as already mentioned, there's only a single call to cuDevicePrimaryCtxRetain
, but we would be more interested to see a number of calls to the same function during runtime, with a probable accumulated time in the order of seconds that could be causing a slowdown.
As for rmm.DeviceBuffer
, I was able yesterday to confirm the same numbers from Ben's flame graph. Apart from that I can confirm with 110% certainty that we do have the RMM pool enabled, there are two ways I can confirm:
rmm.DeviceBuffer
- TCP: 332s
- UCX: 287s
@quasiben also mentioned to me that we had ~81k calls to rmm.DeviceBuffer
in that flamechart. I believe the pool allocator of rmm (cnmem) is known to have performance issues with freeing memory once there's a certain number of allocations being managed. As a quick test it may be worth testing what happens when you create 10/100k allocations, store them somewhere, and then see how long it takes to delete the references.
Wrote a quick test:
import rmm
import time
rmm.reinitialize(pool_allocator=True, initial_pool_size=int(2**34))
num_allocations = 10000
allocations = [None] * num_allocations
start = time.time()
for i in range(num_allocations):
allocations[i] = rmm.DeviceBuffer(size=100)
end = time.time()
time_result = (end - start) * 1000
print(f"Time taken for allocating {num_allocations} buffers: {time_result}ms")
start = time.time()
for i in range(num_allocations):
allocations[i] = None
end = time.time()
time_result = (end - start) * 1000
print(f"Time taken for freeing {num_allocations} buffers: {time_result}ms")
Results:
Time taken for allocating 100000 buffers: 44.32559013366699ms
Time taken for freeing 100000 buffers: 22098.9887714386ms
Time taken for allocating 50000 buffers: 23.27561378479004ms
Time taken for freeing 50000 buffers: 5765.538692474365ms
Time taken for allocating 25000 buffers: 11.168956756591797ms
Time taken for freeing 25000 buffers: 1489.1653060913086ms
Time taken for allocating 10000 buffers: 5.175113677978516ms
Time taken for freeing 10000 buffers: 360.734224319458ms
Note that profiling
dask-cuda-worker
withnvprof
will only gather information from the parent process. Internally,dask-cuda-worker
usesNanny
, and that will fork/spawn new processes which are not catched bynvprof
. Also note that the output from #402 (comment) doesn't have any calls tocudaMalloc
nor kernels launches, which hints that that process is not doing any useful CUDA work. Finally, as already mentioned, there's only a single call tocuDevicePrimaryCtxRetain
, but we would be more interested to see a number of calls to the same function during runtime, with a probable accumulated time in the order of seconds that could be causing a slowdown.
IDK if this would help, but one could add a sitecustomize.py
in the site-packages
directory of the Conda environment and add custom Python code that will be run by each interpreter at startup. This could allow us to gather and dump a bunch of profiling statistics from all interpreters in some common directory. Though I don't know if one can configure nvprof
to inspect an existing process. If so, there is probably some ugly ctypes code we could write to enable profiling that process.
Barring that one might rename the pythonX.Y
executable to something like pythonX.Y-orig
and add a shell script that starts the Python interpreter with nvprof
. Would probably also need to fix all symlinks (like python
, pythonX
) to point at this script. Not sure how well this would work with forking, but it could be another way to profile the cluster of processes.
After a lot more profiling I was able to pinpoint the primary issue with workflows such as this one, see image below:
What happens here is UCX is not always transferring data over NVLink, but over TCP (which incurs in DtoH and HtoD copies), even worse it breaks down each transfer into segments of 8KB, forcing a copy+stream synchronization for many times, making that process extremely slow. The stats of memory copy looks as follows:
CUDA Memory Operation Statistics (nanoseconds)
Time(%) Total Time Operations Average Minimum Maximum Name
------- -------------- ---------- -------------- -------------- -------------- --------------------------------------------------------------------------------
55.6 8407106705 2520239 3335.8 1760 4781363 [CUDA memcpy HtoD]
42.8 6466457263 2578038 2508.3 1120 204227 [CUDA memcpy DtoH]
1.4 207633644 94986 2185.9 1280 24224 [CUDA memcpy DtoD]
0.3 38804550 1279 30339.8 7104 441828 [CUDA memcpy PtoP]
0.0 1577193 719 2193.6 1824 3456 [CUDA memset]
CUDA Memory Operation Statistics (KiB)
Total Operations Average Minimum Maximum Name
------------------- -------------- ------------------- ----------------- ------------------- --------------------------------------------------------------------------------
933.324 719 1.298 0.004 1.500 [CUDA memset]
4699152.953 1279 3674.084 662.445 4469.922 [CUDA memcpy PtoP]
28620716.464 2520239 11.356 0.001 4469.922 [CUDA memcpy HtoD]
28080943.471 94986 295.632 0.004 4469.922 [CUDA memcpy DtoD]
19852018.817 2578038 7.700 0.001 294.039 [CUDA memcpy DtoH]
As can be noticed, only 0.3% of all time spent transferring memory is happening over NVLink (as seen by PtoP
). There's also 1.4% of transfer time happening in DtoD
(AFAIK, that means transfers are going from one device to another via host, i.e., DtoH on source device + HtoD on target device), which is not clear to me why it happens given this is running on a DGX-2 and all devices should have an NVLink connection due to the NVSwitch.
The TCP segment size is configured via UCX_TCP_TX_SEG_SIZE
(default for this is 8KB, as mentioned previously) and UCX_TCP_RX_SEG_SIZE
, and I confirmed increasing those reduce the number of copies+synchronization, with an 8MB size reducing merge compute time from ~22 seconds to ~16 seconds (still marginally slower than regular Python sockets). When the segment sizes are increased I noticed that no copies occur in PtoP (meaning 100% of transfers are going over TCP), as seen below:
CUDA Memory Operation Statistics (nanoseconds)
Time(%) Total Time Operations Average Minimum Maximum Name
------- -------------- ---------- -------------- -------------- -------------- --------------------------------------------------------------------------------
60.4 3695125675 63249 58421.9 1760 3222347 [CUDA memcpy HtoD]
37.4 2291190454 92884 24667.2 1376 2480377 [CUDA memcpy DtoH]
2.2 132241437 54952 2406.5 1280 32032 [CUDA memcpy DtoD]
0.0 364322 160 2277.0 1792 9728 [CUDA memset]
CUDA Memory Operation Statistics (KiB)
Total Operations Average Minimum Maximum Name
------------------- -------------- ------------------- ----------------- ------------------- --------------------------------------------------------------------------------
25365207.218 63249 401.037 0.001 4469.922 [CUDA memcpy HtoD]
18847178.518 54952 342.975 0.004 4469.922 [CUDA memcpy DtoD]
16605241.842 92884 178.774 0.001 4469.922 [CUDA memcpy DtoH]
162.906 160 1.018 0.004 1.500 [CUDA memset]
I was not able yet to find how UCX is determining whether a transfer should go over TCP or over NVLink, but I expected that for the example being discussed in this thread 100% of them would go over NVLink, given that we're not reading from or writing to host explicitly.
I will continue to look for answers here, and I was wondering if @Akshay-Venkatesh perhaps has an idea if we're doing something wrong or how we could better configure things in case this is a misconfiguration issue.
@pentschev Thanks for the details
What happens here is UCX is not always transferring data over NVLink, but over TCP (which incurs in DtoH and HtoD copies), even worse it breaks down each transfer into segments of 8KB, forcing a copy+stream synchronization for many times, making that process extremely slow. The stats of memory copy looks as follows:
This would happen if it's not possible to use cuda_ipc
transport. This could mean a couple of things -- no NVLINK or no PCIe-based peer accessibility; or it could mean that the memory handle obtained from the peer could not be opened at the mapping end for some reason. Some trace reports would help to see if we're falling in the latter category. The 8KB D->H and H->D transfers are part of the fallback rendezvous protocol in UCP (i.e when a high performance path like NVLINK or RDMA fails). Why this occurs on a DGX-2 is something that needs to be found out.
I was not able yet to find how UCX is determining whether a transfer should go over TCP or over NVLink, but I expected that for the example being discussed in this thread 100% of them would go over NVLink, given that we're not reading from or writing to host explicitly.
I expect 100% to go over cuda_ipc
as well and you should be seeing PtoP traffic predominantly.
Were about to repro this result in a standalone benchmark by any chance?
2-process ucx_pertest on current master seems to be working as expected. Correct me if I'm making assumptions.
==297280== NVPROF is profiling process 297280, command: ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000
==292777== NVPROF is profiling process 292777, command: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 1000000
100 0.000 31.931 31.931 29866.35 29866.35 31317 31317
==292777== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 1000000
==292777== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 82.86% 443.37ms 2 221.68ms 8.3250us 443.36ms cudaMalloc
12.55% 67.171ms 1552 43.279us 103ns 2.9163ms cuDeviceGetAttribute
3.40% 18.187ms 16 1.1367ms 989.58us 1.1752ms cuDeviceTotalMem
0.97% 5.1798ms 16 323.74us 269.01us 335.27us cuDeviceGetName
0.18% 952.45us 2 476.23us 31.381us 921.07us cudaFree
0.01% 30.926us 110 281ns 225ns 3.8260us cuIpcGetMemHandle
0.01% 30.168us 110 274ns 224ns 2.1050us cuPointerGetAttributes
0.01% 28.661us 112 255ns 215ns 1.2740us cuMemGetAddressRange
0.00% 24.829us 112 221ns 183ns 2.3250us cuPointerSetAttribute
0.00% 22.185us 126 176ns 140ns 395ns cuDeviceGetUuid
0.00% 21.645us 16 1.3520us 726ns 5.9870us cuDeviceGetPCIBusId
0.00% 18.064us 112 161ns 151ns 255ns cuCtxGetDevice
0.00% 5.8220us 32 181ns 103ns 390ns cuDeviceGet
0.00% 5.4800us 1 5.4800us 5.4800us 5.4800us cudaSetDevice
0.00% 4.5580us 7 651ns 167ns 2.4310us cudaGetDeviceCount
0.00% 3.2320us 5 646ns 231ns 1.3340us cuDeviceGetCount
0.00% 798ns 2 399ns 192ns 606ns cuCtxGetCurrent
0.00% 692ns 2 346ns 181ns 511ns cuDevicePrimaryCtxGetState
==297280== Profiling application: ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000
==297280== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.0293ms 110 9.3570us 9.1520us 10.624us [CUDA memcpy PtoP]
API calls: 81.96% 445.53ms 2 222.76ms 8.5990us 445.52ms cudaMalloc
12.80% 69.566ms 1552 44.823us 115ns 2.9094ms cuDeviceGetAttribute
3.33% 18.079ms 16 1.1299ms 857.21us 1.1895ms cuDeviceTotalMem
0.97% 5.2471ms 16 327.94us 278.28us 350.49us cuDeviceGetName
0.23% 1.2261ms 2593 472ns 371ns 5.7110us cuEventQuery
0.20% 1.0700ms 2 534.99us 30.002us 1.0400ms cudaFree
0.15% 793.71us 1 793.71us 793.71us 793.71us cuIpcOpenMemHandle
0.10% 551.29us 1 551.29us 551.29us 551.29us cuIpcCloseMemHandle
0.09% 498.03us 16 31.126us 1.7630us 455.24us cuStreamCreate
0.09% 471.86us 110 4.2890us 3.7760us 23.013us cuMemcpyDtoDAsync
0.02% 85.216us 16 5.3260us 4.5490us 13.059us cuStreamDestroy
0.02% 82.512us 460 179ns 157ns 474ns cuCtxGetDevice
0.01% 53.665us 110 487ns 424ns 3.0990us cuEventRecord
0.01% 40.689us 128 317ns 281ns 902ns cuEventCreate
0.01% 39.363us 110 357ns 326ns 858ns cuIpcGetMemHandle
0.01% 38.705us 112 345ns 301ns 2.7210us cuMemGetAddressRange
0.01% 36.570us 110 332ns 279ns 2.4450us cuPointerGetAttributes
0.01% 31.836us 128 248ns 229ns 797ns cuEventDestroy
0.01% 30.781us 112 274ns 244ns 2.1880us cuPointerSetAttribute
0.00% 23.897us 115 207ns 166ns 2.0950us cuDeviceGetCount
0.00% 23.796us 130 183ns 166ns 683ns cuDevicePrimaryCtxGetState
0.00% 23.452us 16 1.4650us 819ns 6.1190us cuDeviceGetPCIBusId
0.00% 20.676us 130 159ns 153ns 427ns cuCtxGetCurrent
0.00% 6.9260us 1 6.9260us 6.9260us 6.9260us cudaSetDevice
0.00% 6.9130us 32 216ns 132ns 405ns cuDeviceGet
0.00% 5.3340us 7 762ns 162ns 3.2010us cudaGetDeviceCount
0.00% 2.8440us 16 177ns 153ns 231ns cuDeviceGetUuid
[1]- Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 1000000
[2]+ Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000
Performance looks reasonable as well:
[akvenkatesh@dgx2-03 build]$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000 &
[2] 35613
[akvenkatesh@dgx2-03 build]$ [1580764116.835883] [dgx2-03:35613:0] perftest.c:1416 UCX WARN CPU affinity is not set (bound to 40 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: cuda |
| Recv memory: cuda |
| Message size: 1000000 |
+------------------------------------------------------------------------------------------+
100 0.000 21.999 21.999 43351.04 43351.04 45457 45457
[1]- Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 1000000
[2]+ Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000
[akvenkatesh@dgx2-03 build]$
[akvenkatesh@dgx2-03 build]$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000000 &
[1] 37281
[akvenkatesh@dgx2-03 build]$ [1580764126.139641] [dgx2-03:37281:0] perftest.c:1416 UCX WARN CPU affinity is not set (bound to 40 cpus). Performance may be impacted.
Waiting for connection...
[akvenkatesh@dgx2-03 build]$
[akvenkatesh@dgx2-03 build]$
[akvenkatesh@dgx2-03 build]$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 10000000 &
[2] 38365
[akvenkatesh@dgx2-03 build]$ [1580764132.739735] [dgx2-03:38365:0] perftest.c:1416 UCX WARN CPU affinity is not set (bound to 40 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+------------------------------------------------------------------------------------------+
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
| API: protocol layer |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: cuda |
| Recv memory: cuda |
| Message size: 10000000 |
+------------------------------------------------------------------------------------------+
100 0.000 84.240 84.240 113208.62 113208.62 11871 11871
[1]- Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000000
[2]+ Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 10000000
Will update with an alltoall benchmark to see if something breaks maybe with 16 GPUs.
Alltoall looks to run as expected as well:
mpirun -np 16 -host dgx2-03 --oversubscribe -npernode 16 -x LD_LIBRARY_PATH --mca pml ucx --mca btl ^openib,smcuda -x UCX_MEMTYPE_CACHE=n -x UCX_RNDV_THRESH=8192 -x UCX_TLS=rc,cuda_copy,cuda_ipc,mm ./get_local_ompi_rank `which nvprof` ./mpi/collective/osu_alltoall -m 1048576:1048576 -d cuda &> dump
[akvenkatesh@dgx2-03 osu-micro-benchmarks-5.6.2]$ cat dump | grep -r "CUDA memcpy"
dump: GPU activities: 91.49% 16.768ms 1650 10.162us 9.6000us 11.585us [CUDA memcpy PtoP]
dump: 8.49% 1.5555ms 880 1.7670us 1.4720us 9.2480us [CUDA memcpy DtoD]
dump: GPU activities: 91.44% 16.779ms 1650 10.169us 9.6000us 11.136us [CUDA memcpy PtoP]
dump: 8.54% 1.5661ms 880 1.7790us 1.4720us 10.176us [CUDA memcpy DtoD]
dump: GPU activities: 91.49% 16.739ms 1650 10.145us 9.5680us 10.880us [CUDA memcpy PtoP]
dump: 8.49% 1.5527ms 880 1.7640us 1.4720us 2.6880us [CUDA memcpy DtoD]
dump: GPU activities: 91.43% 16.744ms 1650 10.147us 9.5050us 11.040us [CUDA memcpy PtoP]
dump: 8.55% 1.5653ms 880 1.7780us 1.4720us 9.4080us [CUDA memcpy DtoD]
dump: GPU activities: 91.54% 16.886ms 1650 10.233us 9.6320us 18.016us [CUDA memcpy PtoP]
dump: 8.44% 1.5575ms 880 1.7690us 1.4720us 9.0240us [CUDA memcpy DtoD]
dump: GPU activities: 91.48% 16.855ms 1650 10.215us 9.6010us 10.944us [CUDA memcpy PtoP]
dump: GPU activities: 91.49% 16.769ms 1650 10.163us 9.5680us 21.504us [CUDA memcpy PtoP]
dump: 8.50% 1.5658ms 880 1.7790us 1.5040us 2.4960us [CUDA memcpy DtoD]
dump: GPU activities: 91.56% 16.820ms 1650 10.193us 9.5680us 11.488us [CUDA memcpy PtoP]
dump: 8.42% 1.5463ms 880 1.7570us 1.4720us 8.7040us [CUDA memcpy DtoD]
dump: 8.49% 1.5555ms 880 1.7670us 1.4720us 9.7920us [CUDA memcpy DtoD]
dump: GPU activities: 91.50% 16.833ms 1650 10.201us 9.5680us 11.200us [CUDA memcpy PtoP]
dump: 8.48% 1.5597ms 880 1.7720us 1.4720us 14.945us [CUDA memcpy DtoD]
dump: GPU activities: 91.56% 16.879ms 1650 10.229us 9.6320us 11.872us [CUDA memcpy PtoP]
dump: 8.41% 1.5512ms 880 1.7620us 1.4720us 2.7200us [CUDA memcpy DtoD]
dump: GPU activities: 91.51% 16.868ms 1650 10.223us 9.6640us 21.696us [CUDA memcpy PtoP]
dump: GPU activities: 91.51% 16.771ms 1650 10.164us 9.5680us 11.040us [CUDA memcpy PtoP]
dump: 8.47% 1.5522ms 880 1.7630us 1.4720us 9.6970us [CUDA memcpy DtoD]
dump: 8.47% 1.5611ms 880 1.7730us 1.4720us 14.144us [CUDA memcpy DtoD]
dump: GPU activities: 91.52% 16.835ms 1650 10.202us 9.6000us 11.104us [CUDA memcpy PtoP]
dump: 8.46% 1.5562ms 880 1.7680us 1.5040us 2.9760us [CUDA memcpy DtoD]
dump: GPU activities: 91.53% 16.754ms 1650 10.153us 9.4720us 11.168us [CUDA memcpy PtoP]
dump: 8.45% 1.5464ms 880 1.7570us 1.4720us 10.112us [CUDA memcpy DtoD]
dump: GPU activities: 91.44% 16.749ms 1650 10.151us 9.5360us 11.104us [CUDA memcpy PtoP]
dump: 8.54% 1.5651ms 880 1.7780us 1.4720us 9.7920us [CUDA memcpy DtoD]
dump: GPU activities: 91.57% 16.822ms 1650 10.195us 9.6320us 17.888us [CUDA memcpy PtoP]
dump: 8.41% 1.5455ms 880 1.7560us 1.4720us 2.7200us [CUDA memcpy DtoD]
Some trace reports would help to see if we're falling in the latter category.
I've been trying to go through the logs today, but there's a lot there I don't quite understand yet, do you have any hints on what we could be looking for?
As for reproducers, the easiest one is to use ucx_perftest
, but it requires a small enough transfer size. For example, running with a size of 1MB I will always see PtoP, but when running with a reasonably small size (100KB or less), I will always see DtoH/HtoD. I was also able to reproduce this with a simple UCX-Py server/client code, but sizes were different then, with 40KB and up going PtoP and 20KB or less going DtoH/HtoD (didn't test anything in between 20 and 40KB). If you're interested I can share the Python code, but it seems like running with ucx_perftest
with small enough transfers should have the same effect. It's still a mystery to me why such sizes vary from one application to another though. The tests I described here were all performed on a DGX-1 using only GPUs 0 and 1.
Some trace reports would help to see if we're falling in the latter category.
I've been trying to go through the logs today, but there's a lot there I don't quite understand yet, do you have any hints on what we could be looking for?
I would look for registration failure notifications from cuda_ipc
transport. You could grep through instances of cuda_ipc_md.c
or cuda_ipc_cache.c
to see if you notice registration failures or errors opening memory handles.
As for reproducers, the easiest one is to use
ucx_perftest
, but it requires a small enough transfer size. For example, running with a size of 1MB I will always see PtoP, but when running with a reasonably small size (100KB or less), I will always see DtoH/HtoD. I was also able to reproduce this with a simple UCX-Py server/client code, but sizes were different then, with 40KB and up going PtoP and 20KB or less going DtoH/HtoD (didn't test anything in between 20 and 40KB). If you're interested I can share the Python code, but it seems like running withucx_perftest
with small enough transfers should have the same effect. It's still a mystery to me why such sizes vary from one application to another though. The tests I described here were all performed on a DGX-1 using only GPUs 0 and 1.
@pentschev Can you check if I'm not using the env variables that you use? I don't see an issue with ucx_perftest
with 100k or 10k.
100K:
==278438== ==278402== NVPROF is profiling process 278438, command: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 100000
NVPROF is profiling process 278402, command: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 100000
100 0.000 26.200 26.200 3640.00 3640.00 38168 38168
==278402== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 100000
==278402== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 81.94% 459.52ms 2 229.76ms 7.8410us 459.51ms cudaMalloc
13.28% 74.461ms 1552 47.977us 107ns 2.8405ms cuDeviceGetAttribute
3.46% 19.408ms 16 1.2130ms 1.1445ms 1.4109ms cuDeviceTotalMem
1.13% 6.3248ms 16 395.30us 328.37us 495.10us cuDeviceGetName
0.16% 919.77us 2 459.89us 21.825us 897.95us cudaFree
0.01% 29.488us 110 268ns 207ns 3.7290us cuIpcGetMemHandle
0.00% 26.155us 110 237ns 188ns 2.3020us cuPointerGetAttributes
0.00% 26.118us 112 233ns 196ns 1.1900us cuMemGetAddressRange
0.00% 25.794us 16 1.6120us 851ns 5.0020us cuDeviceGetPCIBusId
0.00% 22.919us 112 204ns 166ns 2.2910us cuPointerSetAttribute
0.00% 22.622us 126 179ns 142ns 559ns cuDeviceGetUuid
0.00% 17.974us 112 160ns 140ns 300ns cuCtxGetDevice
0.00% 6.3480us 1 6.3480us 6.3480us 6.3480us cudaSetDevice
0.00% 5.7830us 32 180ns 114ns 400ns cuDeviceGet
0.00% 4.2680us 7 609ns 133ns 2.2300us cudaGetDeviceCount
0.00% 3.0450us 5 609ns 109ns 1.4950us cuDeviceGetCount
0.00% 685ns 2 342ns 189ns 496ns cuCtxGetCurrent
0.00% 441ns 2 220ns 144ns 297ns cuDevicePrimaryCtxGetState
==278438== Profiling application: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 100000
==278438== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 342.18us 110 3.1100us 3.0080us 4.0320us [CUDA memcpy PtoP]
API calls: 81.07% 459.34ms 2 229.67ms 8.5820us 459.34ms cudaMalloc
13.67% 77.483ms 1552 49.924us 116ns 2.8541ms cuDeviceGetAttribute
3.39% 19.233ms 16 1.2021ms 805.25us 1.4889ms cuDeviceTotalMem
0.97% 5.4941ms 16 343.38us 297.27us 489.64us cuDeviceGetName
0.20% 1.1113ms 2 555.63us 36.031us 1.0752ms cudaFree
0.15% 859.67us 1 859.67us 859.67us 859.67us cuIpcOpenMemHandle
0.15% 836.96us 1689 495ns 348ns 11.004us cuEventQuery
0.10% 575.64us 16 35.977us 1.8620us 527.82us cuStreamCreate
0.10% 553.14us 1 553.14us 553.14us 553.14us cuIpcCloseMemHandle
0.10% 547.57us 110 4.9770us 4.1250us 29.591us cuMemcpyDtoDAsync
0.02% 96.098us 16 6.0060us 4.9500us 14.927us cuStreamDestroy
0.01% 80.943us 460 175ns 146ns 941ns cuCtxGetDevice
0.01% 52.325us 110 475ns 401ns 3.2800us cuEventRecord
0.01% 49.908us 128 389ns 278ns 1.0000us cuEventCreate
0.01% 39.616us 128 309ns 239ns 717ns cuEventDestroy
0.01% 36.986us 110 336ns 295ns 1.3200us cuIpcGetMemHandle
0.01% 34.920us 112 311ns 275ns 780ns cuMemGetAddressRange
0.01% 33.906us 110 308ns 250ns 2.9390us cuPointerGetAttributes
0.01% 29.237us 112 261ns 221ns 2.2320us cuPointerSetAttribute
0.00% 25.651us 130 197ns 183ns 636ns cuDevicePrimaryCtxGetState
0.00% 23.243us 16 1.4520us 697ns 6.6430us cuDeviceGetPCIBusId
0.00% 21.337us 130 164ns 157ns 516ns cuCtxGetCurrent
0.00% 21.250us 115 184ns 150ns 969ns cuDeviceGetCount
0.00% 7.8710us 1 7.8710us 7.8710us 7.8710us cudaSetDevice
0.00% 6.8170us 32 213ns 144ns 540ns cuDeviceGet
0.00% 5.4290us 7 775ns 131ns 3.4120us cudaGetDeviceCount
0.00% 3.2900us 16 205ns 156ns 572ns cuDeviceGetUuid
[1]- Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 100000
[2]+ Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 100000
10K:
==278664== NVPROF is profiling process 278664, command: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==278699== NVPROF is profiling process 278699, command: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
100 0.000 24.152 24.152 394.87 394.87 41405 41405
==278664== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==278664== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 82.28% 473.66ms 2 236.83ms 10.215us 473.65ms cudaMalloc
12.95% 74.519ms 1552 48.014us 103ns 2.7595ms cuDeviceGetAttribute
3.47% 19.986ms 16 1.2491ms 663.96us 2.5323ms cuDeviceTotalMem
1.11% 6.3640ms 16 397.75us 165.70us 661.48us cuDeviceGetName
0.16% 913.62us 2 456.81us 26.705us 886.91us cudaFree
0.00% 28.552us 110 259ns 212ns 4.0810us cuIpcGetMemHandle
0.00% 27.676us 110 251ns 180ns 6.0480us cuPointerGetAttributes
0.00% 27.349us 112 244ns 211ns 1.9520us cuMemGetAddressRange
0.00% 23.671us 112 211ns 168ns 2.2680us cuPointerSetAttribute
0.00% 22.160us 16 1.3850us 699ns 6.0630us cuDeviceGetPCIBusId
0.00% 21.742us 126 172ns 143ns 726ns cuDeviceGetUuid
0.00% 16.242us 112 145ns 137ns 244ns cuCtxGetDevice
0.00% 7.9580us 1 7.9580us 7.9580us 7.9580us cudaSetDevice
0.00% 6.9000us 32 215ns 118ns 738ns cuDeviceGet
0.00% 5.7260us 7 818ns 147ns 3.7480us cudaGetDeviceCount
0.00% 2.2730us 5 454ns 241ns 843ns cuDeviceGetCount
0.00% 709ns 2 354ns 158ns 551ns cuDevicePrimaryCtxGetState
0.00% 681ns 2 340ns 161ns 520ns cuCtxGetCurrent
==278699== Profiling application: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
==278699== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 247.91us 110 2.2530us 2.2080us 2.6570us [CUDA memcpy PtoP]
API calls: 82.06% 469.13ms 2 234.56ms 9.9870us 469.12ms cudaMalloc
12.61% 72.115ms 1552 46.465us 101ns 3.4354ms cuDeviceGetAttribute
3.44% 19.692ms 16 1.2307ms 650.41us 1.7946ms cuDeviceTotalMem
1.01% 5.7823ms 16 361.39us 164.06us 656.16us cuDeviceGetName
0.19% 1.0938ms 2 546.90us 32.128us 1.0617ms cudaFree
0.15% 850.92us 1 850.92us 850.92us 850.92us cuIpcOpenMemHandle
0.15% 843.02us 1764 477ns 342ns 6.7340us cuEventQuery
0.10% 577.28us 16 36.080us 1.6030us 530.96us cuStreamCreate
0.10% 564.59us 1 564.59us 564.59us 564.59us cuIpcCloseMemHandle
0.09% 501.32us 110 4.5570us 3.9720us 26.119us cuMemcpyDtoDAsync
0.02% 90.220us 16 5.6380us 4.5560us 13.454us cuStreamDestroy
0.01% 72.425us 461 157ns 132ns 899ns cuCtxGetDevice
0.01% 46.347us 110 421ns 368ns 3.4610us cuEventRecord
0.01% 45.085us 129 349ns 247ns 875ns cuEventCreate
0.01% 39.424us 129 305ns 228ns 1.1890us cuEventDestroy
0.01% 33.959us 112 303ns 264ns 822ns cuMemGetAddressRange
0.01% 33.356us 110 303ns 222ns 5.6730us cuPointerGetAttributes
0.01% 32.970us 110 299ns 249ns 1.0340us cuIpcGetMemHandle
0.00% 26.858us 112 239ns 198ns 2.0650us cuPointerSetAttribute
0.00% 21.057us 131 160ns 147ns 441ns cuDevicePrimaryCtxGetState
0.00% 20.830us 16 1.3010us 711ns 4.7010us cuDeviceGetPCIBusId
0.00% 19.443us 115 169ns 144ns 780ns cuDeviceGetCount
0.00% 18.469us 131 140ns 133ns 432ns cuCtxGetCurrent
0.00% 6.9110us 1 6.9110us 6.9110us 6.9110us cudaSetDevice
0.00% 6.2780us 32 196ns 111ns 426ns cuDeviceGet
0.00% 5.3400us 7 762ns 136ns 3.0850us cudaGetDeviceCount
0.00% 3.0800us 16 192ns 131ns 293ns cuDeviceGetUuid
[1]- Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
[2]+ Done UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
I'm still seeing 100% activity on PtoP
channel. This is all on dgx-2 and NOT dgx-1. On dgx-1 you can expect non-PtoP for devices that are not IPC accessible (like GPU 0 and 5 for example). That said, I definitely don't see issues with DGX-1 for peer accessible GPUs.
Interesting, using the same variables as you do I see PtoP
even for buffers as small as 10KB -- meaning that works. The variables I've been using (and that we're using in Dask generally) are: UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm
. I tried quickly to use the same variables that you're using in Dask, but it didn't immediately work (I see some crashes), I'll have to debug it further.
[Ordering in UCX_TLS
does not matter, correct] ?
for references, the data we send typically looks like the following (mix of host and gpu memory):
In [17]: frames
Out[17]:
[b'',
b'\x80',
b'\x83\xa7headers\x81\x91\xa4data\x8a\xa4type\xc4$\x80\x03ccudf.core.dataframe\nDataFrame\nq\x00.\xa5index\x85\xacindex_column\x83\xa4type\xc41\x80\x03ccudf.core.column.numerical\nNumericalColumn\nq\x00.\xa5dtype\xa3<i8\xabframe_count\x01\xa4name\xc4\x04\x80\x03N.\xa5dtype\xc4C\x80\x03cnumpy\ndtype\nq\x00X\x02\x00\x00\x00i8q\x01K\x00K\x01\x87q\x02Rq\x03(K\x03X\x01\x00\x00\x00<q\x04NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\x05b.\xa4type\xc4#\x80\x03ccudf.core.index\nGenericIndex\nq\x00.\xabframe_count\x01\xb1index_frame_count\x01\xaccolumn_names\xc4\x16\x80\x03X\x01\x00\x00\x00aq\x00X\x01\x00\x00\x00bq\x01\x86q\x02.\xa7columns\x92\x83\xa4type\xc41\x80\x03ccudf.core.column.numerical\nNumericalColumn\nq\x00.\xa5dtype\xa3<f8\xabframe_count\x01\x83\xa4type\xc41\x80\x03ccudf.core.column.numerical\nNumericalColumn\nq\x00.\xa5dtype\xa3<i8\xabframe_count\x01\xaftype-serialized\xc40\x80\x04\x95%\x00\x00\x00\x00\x00\x00\x00\x8c\x13cudf.core.dataframe\x94\x8c\tDataFrame\x94\x93\x94.\xaaserializer\xa4cuda\xabcompression\x93\xc0\xc0\xc0\xa7lengths\x93\xce\x00z\x12\x00\xce\x00z\x12\x00\xce\x00z\x12\x00\xa5count\x03\xa4keys\x91\x91\xa4data\xabbytestrings\x90',
<numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f24fd997c50>,
<numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f24fd99b110>,
<numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f24fd99b290>]
This is a bit orthogonal to the present conversation (though relates directly to acquiring contexts that came up earlier). @madsbk put together PR ( https://github.com/numba/numba/pull/5189 ), which should make checking __cuda_array_interface__
on Numba DeviceNDArray
's not require a context (IOW not calling cuDevicePrimaryCtxRetain
). Could be causing some of the overhead we are seeing here.
UCX_SOCKADDR_TLS_PRIORITY
Same results with the parameters you provided @pentschev
10M:
==304392== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000000
==304392== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 83.37% 438.38ms 2 219.19ms 570.77us 437.81ms cudaMalloc
11.89% 62.533ms 1552 40.292us 108ns 1.8181ms cuDeviceGetAttribute
3.36% 17.678ms 16 1.1049ms 645.88us 1.1560ms cuDeviceTotalMem
0.95% 5.0202ms 16 313.76us 162.19us 336.13us cuDeviceGetName
0.37% 1.9695ms 2 984.76us 927.43us 1.0421ms cudaFree
0.01% 33.851us 110 307ns 269ns 2.1330us cuPointerGetAttributes
0.01% 33.243us 112 296ns 261ns 1.9720us cuMemGetAddressRange
0.01% 33.220us 110 302ns 257ns 3.5170us cuIpcGetMemHandle
0.00% 25.926us 112 231ns 191ns 2.4050us cuPointerSetAttribute
0.00% 22.709us 126 180ns 147ns 444ns cuDeviceGetUuid
0.00% 21.628us 112 193ns 155ns 3.3630us cuCtxGetDevice
0.00% 20.914us 16 1.3070us 761ns 5.8450us cuDeviceGetPCIBusId
0.00% 8.8660us 32 277ns 119ns 3.1790us cuDeviceGet
0.00% 5.9510us 1 5.9510us 5.9510us 5.9510us cudaSetDevice
0.00% 4.5130us 7 644ns 145ns 2.2140us cudaGetDeviceCount
0.00% 2.8480us 5 569ns 126ns 1.0750us cuDeviceGetCount
0.00% 866ns 2 433ns 251ns 615ns cuCtxGetCurrent
0.00% 773ns 2 386ns 221ns 552ns cuDevicePrimaryCtxGetState
==304427== Profiling application: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000000
==304427== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 7.8750ms 110 71.591us 71.329us 72.801us [CUDA memcpy PtoP]
API calls: 81.69% 437.28ms 2 218.64ms 864.81us 436.41ms cudaMalloc
12.35% 66.104ms 1552 42.592us 111ns 1.8033ms cuDeviceGetAttribute
3.29% 17.617ms 16 1.1011ms 658.89us 1.1626ms cuDeviceTotalMem
0.96% 5.1255ms 16 320.34us 163.96us 368.76us cuDeviceGetName
0.54% 2.8884ms 7078 408ns 348ns 3.7320us cuEventQuery
0.41% 2.1685ms 2 1.0842ms 1.0495ms 1.1190ms cudaFree
0.35% 1.8490ms 1 1.8490ms 1.8490ms 1.8490ms cuIpcOpenMemHandle
0.11% 603.28us 110 5.4840us 4.8090us 28.923us cuMemcpyDtoDAsync
0.10% 560.34us 16 35.021us 1.6480us 513.85us cuStreamCreate
0.10% 553.29us 1 553.29us 553.29us 553.29us cuIpcCloseMemHandle
0.02% 94.678us 16 5.9170us 4.6620us 17.035us cuStreamDestroy
0.01% 79.139us 460 172ns 136ns 822ns cuCtxGetDevice
0.01% 54.332us 110 493ns 432ns 3.3140us cuEventRecord
0.01% 46.309us 128 361ns 244ns 907ns cuEventCreate
0.01% 42.095us 110 382ns 355ns 1.3010us cuIpcGetMemHandle
0.01% 39.017us 110 354ns 302ns 2.7690us cuPointerGetAttributes
0.01% 37.881us 128 295ns 232ns 1.0330us cuEventDestroy
0.01% 34.031us 112 303ns 278ns 983ns cuMemGetAddressRange
0.01% 30.684us 112 273ns 239ns 1.9910us cuPointerSetAttribute
0.00% 21.809us 115 189ns 155ns 1.6400us cuDeviceGetCount
0.00% 21.661us 130 166ns 155ns 401ns cuDevicePrimaryCtxGetState
0.00% 20.980us 16 1.3110us 763ns 5.4060us cuDeviceGetPCIBusId
0.00% 19.123us 130 147ns 137ns 682ns cuCtxGetCurrent
0.00% 7.5260us 1 7.5260us 7.5260us 7.5260us cudaSetDevice
0.00% 6.2510us 32 195ns 117ns 445ns cuDeviceGet
0.00% 4.8820us 7 697ns 128ns 3.1100us cudaGetDeviceCount
0.00% 2.9790us 16 186ns 150ns 301ns cuDeviceGetUuid
[1]- Done UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000000
[2]+ Done UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm `which nvprof` ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000000
10K:
==304779== NVPROF is profiling process 304779, command: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
==304743== NVPROF is profiling process 304743, command: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
100 0.000 46.432 46.432 205.39 205.39 21537 21537
==304743== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==304743== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 82.45% 448.56ms 2 224.28ms 9.8690us 448.55ms cudaMalloc
12.97% 70.563ms 1552 45.465us 104ns 2.9980ms cuDeviceGetAttribute
3.30% 17.961ms 16 1.1226ms 840.76us 1.3979ms cuDeviceTotalMem
1.06% 5.7766ms 16 361.04us 316.69us 828.20us cuDeviceGetName
0.17% 937.56us 2 468.78us 28.030us 909.53us cudaFree
0.01% 36.815us 110 334ns 274ns 3.2790us cuIpcGetMemHandle
0.01% 36.798us 110 334ns 280ns 2.3640us cuPointerGetAttributes
0.01% 33.805us 112 301ns 264ns 1.0570us cuMemGetAddressRange
0.01% 27.361us 112 244ns 199ns 2.3790us cuPointerSetAttribute
0.00% 25.180us 16 1.5730us 707ns 5.8900us cuDeviceGetPCIBusId
0.00% 23.105us 126 183ns 152ns 291ns cuDeviceGetUuid
0.00% 18.990us 112 169ns 158ns 278ns cuCtxGetDevice
0.00% 8.6330us 1 8.6330us 8.6330us 8.6330us cudaSetDevice
0.00% 6.5090us 7 929ns 151ns 3.6160us cudaGetDeviceCount
0.00% 6.2310us 32 194ns 110ns 421ns cuDeviceGet
0.00% 2.3970us 5 479ns 230ns 818ns cuDeviceGetCount
0.00% 834ns 2 417ns 324ns 510ns cuCtxGetCurrent
0.00% 721ns 2 360ns 222ns 499ns cuDevicePrimaryCtxGetState
==304779== Profiling application: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
==304779== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 248.64us 110 2.2600us 2.2080us 2.7840us [CUDA memcpy PtoP]
API calls: 82.24% 448.74ms 2 224.37ms 9.6500us 448.73ms cudaMalloc
12.55% 68.466ms 1552 44.114us 103ns 2.9920ms cuDeviceGetAttribute
3.28% 17.922ms 16 1.1201ms 818.38us 1.4155ms cuDeviceTotalMem
1.08% 5.8701ms 16 366.88us 273.04us 1.0082ms cuDeviceGetName
0.20% 1.1168ms 2 558.41us 31.612us 1.0852ms cudaFree
0.15% 841.04us 1 841.04us 841.04us 841.04us cuIpcOpenMemHandle
0.10% 549.31us 16 34.331us 1.7210us 501.79us cuStreamCreate
0.10% 540.45us 1 540.45us 540.45us 540.45us cuIpcCloseMemHandle
0.10% 527.22us 110 4.7920us 4.3420us 30.328us cuMemcpyDtoDAsync
0.09% 514.88us 785 655ns 372ns 10.039us cuEventQuery
0.02% 87.440us 16 5.4650us 4.5070us 14.245us cuStreamDestroy
0.02% 81.989us 461 177ns 131ns 568ns cuCtxGetDevice
0.01% 54.709us 129 424ns 261ns 5.2310us cuEventCreate
0.01% 52.930us 110 481ns 407ns 3.1770us cuEventRecord
0.01% 39.507us 110 359ns 297ns 1.3080us cuIpcGetMemHandle
0.01% 38.424us 110 349ns 286ns 2.9290us cuPointerGetAttributes
0.01% 36.447us 112 325ns 288ns 770ns cuMemGetAddressRange
0.01% 36.214us 129 280ns 226ns 666ns cuEventDestroy
0.01% 29.974us 112 267ns 227ns 2.2360us cuPointerSetAttribute
0.00% 22.874us 16 1.4290us 842ns 4.5790us cuDeviceGetPCIBusId
0.00% 21.202us 131 161ns 148ns 560ns cuDevicePrimaryCtxGetState
0.00% 21.128us 115 183ns 154ns 764ns cuDeviceGetCount
0.00% 18.667us 131 142ns 135ns 533ns cuCtxGetCurrent
0.00% 7.3690us 32 230ns 113ns 492ns cuDeviceGet
0.00% 5.8050us 1 5.8050us 5.8050us 5.8050us cudaSetDevice
0.00% 5.1700us 7 738ns 138ns 3.0220us cudaGetDeviceCount
0.00% 3.3550us 16 209ns 154ns 336ns cuDeviceGetUuid
[1]- Done UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
[2]+ Done UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm `which nvprof` ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
Lmk if I've missed something or made some mistake somewhere.
The comment from @jakirkham is interesting because UCX assumes that the current CUDA context is the one which was used to allocate memory resources (and tries to obtain memory handles with respect to that). If you allocated memory using a different context (possibly in a different thread), then calling getmemhandle would fail and you would end up not using cuda_ipc
effectively. Could that be happening for the dask case?
Yes that sounds plausible. Thanks for that insight Akshay!
So we should change the current context back to the one used for allocating before sending, is that right? Or is there a way for us to inform UCX of the context it should use?
Yes that sounds plausible. Thanks for that insight Akshay!
So we should change the current context back to the one used for allocating before sending, is that right? Or is there a way for us to inform UCX of the context it should use?
There isn't a way today to indicate to UCX (through a parameter for instance) the context used for allocating memory. The expectation is that the calling thread have the specific context at the top of its stack before making the UCX call.
After some digging it seems RMM (our memory pool manager) only uses the runtime API. So it doesn't have a context itself. I'm not sure if there is a good way to get the context when using the runtime API or from the memory allocations themselves. Suggestions welcome 🙂
Actually we might be able to query this information from the Driver API using cuPointerGetAttribute
. Would it be possible to handle/add this within UCX itself?
Could you let me know what version of UCX you're using? I ran again the same as you and I still see DtoH/HtoD for 10KB, using the same command (except adapting for different paths in my case):
$ UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm /usr/local/cuda-10.1/bin/nvprof ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
[1580821729.763074] [dgx11:50030:0] perftest.c:1416 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
Waiting for connection...
+------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: cuda |
| Recv memory: cuda |
| Message size: 10000 |
+------------------------------------------------------------------------------------------+
==50030== NVPROF is profiling process 50030, command: ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==50030== Profiling application: ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==50030== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 458.70us 220 2.0850us 1.6950us 2.8790us [CUDA memcpy HtoD]
API calls: 88.06% 507.63ms 2 253.82ms 9.2140us 507.63ms cudaMalloc
6.80% 39.174ms 776 50.481us 119ns 3.2212ms cuDeviceGetAttribute
3.38% 19.465ms 8 2.4332ms 1.2599ms 4.1953ms cuDeviceTotalMem
1.03% 5.9627ms 8 745.34us 291.41us 1.3211ms cuDeviceGetName
0.34% 1.9565ms 220 8.8930us 8.1550us 23.290us cudaStreamSynchronize
0.22% 1.2764ms 220 5.8010us 3.9990us 24.022us cudaMemcpyAsync
0.15% 852.10us 2 426.05us 22.697us 829.40us cudaFree
0.01% 49.537us 110 450ns 330ns 6.3410us cuPointerGetAttributes
0.01% 39.525us 112 352ns 248ns 7.2160us cuPointerSetAttribute
0.00% 18.755us 1 18.755us 18.755us 18.755us cudaSetDevice
0.00% 16.804us 1 16.804us 16.804us 16.804us cudaStreamDestroy
0.00% 14.911us 1 14.911us 14.911us 14.911us cudaStreamCreateWithFlags
0.00% 12.799us 8 1.5990us 1.2380us 3.3530us cuDeviceGetPCIBusId
0.00% 5.8330us 16 364ns 179ns 653ns cuDeviceGet
0.00% 3.7620us 2 1.8810us 540ns 3.2220us cuMemGetAddressRange
0.00% 3.3320us 5 666ns 139ns 1.1820us cuDeviceGetCount
0.00% 2.3820us 5 476ns 124ns 759ns cudaGetDeviceCount
0.00% 2.3790us 8 297ns 210ns 702ns cuDeviceGetUuid
0.00% 1.4730us 2 736ns 456ns 1.0170us cuCtxGetCurrent
0.00% 1.0200us 2 510ns 330ns 690ns cuDevicePrimaryCtxGetState
0.00% 726ns 2 363ns 253ns 473ns cuCtxGetDevice
$ UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm /usr/local/cuda-10.1/bin/nvprof ucx_perftest dgx11 -t tag_bw -m cuda -n 100 -s 10000
[1580821735.327625] [dgx11:50112:0] perftest.c:1416 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
==50112== NVPROF is profiling process 50112, command: ucx_perftest dgx11 -t tag_bw -m cuda -n 100 -s 10000
100 0.000 61.769 61.769 154.39 154.39 16189 16189
==50112== Profiling application: ucx_perftest dgx11 -t tag_bw -m cuda -n 100 -s 10000
==50112== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 412.59us 220 1.8750us 1.6960us 3.5200us [CUDA memcpy DtoH]
API calls: 88.09% 500.50ms 2 250.25ms 8.6370us 500.49ms cudaMalloc
7.78% 44.220ms 776 56.984us 119ns 4.1772ms cuDeviceGetAttribute
2.65% 15.044ms 8 1.8805ms 1.1896ms 2.5819ms cuDeviceTotalMem
0.66% 3.7769ms 220 17.167us 15.858us 36.109us cudaMemcpyAsync
0.55% 3.1409ms 8 392.62us 119.54us 888.50us cuDeviceGetName
0.13% 717.44us 2 358.72us 16.247us 701.20us cudaFree
0.10% 554.59us 220 2.5200us 2.3380us 11.097us cudaStreamSynchronize
0.02% 86.939us 110 790ns 649ns 2.7950us cuPointerGetAttributes
0.01% 40.389us 112 360ns 335ns 1.3690us cuPointerSetAttribute
0.00% 21.360us 16 1.3350us 169ns 17.349us cuDeviceGet
0.00% 19.180us 1 19.180us 19.180us 19.180us cudaStreamDestroy
0.00% 19.126us 1 19.126us 19.126us 19.126us cudaStreamCreateWithFlags
0.00% 12.666us 8 1.5830us 1.2010us 2.8990us cuDeviceGetPCIBusId
0.00% 5.0400us 1 5.0400us 5.0400us 5.0400us cudaSetDevice
0.00% 2.9710us 5 594ns 191ns 888ns cuDeviceGetCount
0.00% 2.6390us 5 527ns 159ns 796ns cudaGetDeviceCount
0.00% 2.3940us 2 1.1970us 397ns 1.9970us cuMemGetAddressRange
0.00% 2.1490us 8 268ns 231ns 383ns cuDeviceGetUuid
0.00% 2.0860us 2 1.0430us 522ns 1.5640us cuCtxGetCurrent
0.00% 1.4850us 2 742ns 332ns 1.1530us cuDevicePrimaryCtxGetState
0.00% 707ns 2 353ns 252ns 455ns cuCtxGetDevice
The output above is from a DGX-1, but I see the same on a DGX-2. I've been using UCX master (commit 1c75cfdc377680a90b613deabbcca61fb0c050f7) and tried upgrading to latest (commit 5ef72d9f2a3648b19b80333c1ff496987173b9d1), but see the same result in both cases.
Regarding the the context discussion: we fixed a lot of context issues with Dask, where threads would be spawned before a CUDA context was created, which meant that the new thread didn't have a context assigned and would ultimately lead to a crash in UCX. All these issues have been fixed, otherwise we would be seeing crashes all the time. IOW, I don't think there's any CUDA context related issues anymore. We're creating a context for each device at Dask startup, so unless someone is mistakenly creating a context during runtime, I don't believe there is more than one context created at any time.
I am seeing the same as @pentschev on both master and v1.7. I built master with following configuration:
../contrib/configure-devel --prefix=$CONDA_PREFIX --with-cuda=$CUDA_HOME --enable-mt CPPFLAGS="-I/$CUDA_HOME/include"
Been reading through a lot of UCX cuda code, and I have an idea. Be warned it maybe a bad idea, don't know enough about cuda and ipc transfers. It is possible to register memory with UCX via ucp_mem_map()
, it's usually used with RMA operations to register memory with the NIC. I wonder if it would help the performance to have RMM register the cuda memory pools with UCX itself and act as a cache at a higher level. My main fear would be if that would confuse the rendezvous pipeline with the memory pools already registered.
I've gone well past what I know about cuda though, so I could be horribly wrong, but wanted to float UCX APIs that could be useful.
@quasiben and I found that manually setting UCX_RNDV_THRESH
to a really low value (e.g., 1
) works around the issue. IOW, all worker-worker transfers seem to be going over PtoP in that situation. It also solves those cases we mentioned above where depending on configuration size of 10000 would not go over PtoP.
The transports enabled have an effect on the rendezvous threshold when it's set to auto
(default). For example, on a DGX-1, having UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc
will show ucp_ep.c:1222 UCX TRACE rndv threshold is 474127 (send_nbr: 262144)
, but if we add mm
in UCX_TLS
as well, then we'll see ucp_ep.c:1222 UCX TRACE rndv threshold is 8256 (send_nbr: 262144)
.
I checked that setting the threshold indeed decreases the runtime to 18 seconds (from about 22 before), but it's still slower than Python sockets for this particular example. I wasn't yet able to give a better look at trace and profiling it, but will do that tomorrow.
Thanks also @MattBBaker for the suggestion. I think the way CUDA memory pools are handled today with UCX seems to be working fine. I was already able to confirm that the UCX's CUDA IPC cache is performing well (no unmapping/remapping of handles) as long as we do use a memory pool (i.e., not using one is generally very slow due to the frequent opening and closing of handles). I'm not totally sure this is an answer to your suggestion though.
@quasiben and I found that manually setting
UCX_RNDV_THRESH
to a really low value (e.g.,1
) works around the issue.
Thanks @pentschev! 😄
Should we bake this into dask-cuda or ucx-py as a workaround near term? Would this address the same need as PR ( https://github.com/rapidsai/ucx-py/pull/406 ) or is there still more to do to make things functional near term?
cc @madsbk
I'm not sure yet @jakirkham . I'm doing some profiling still to verify that everything works correctly, plus I'll need to check whether this affects other transports, so it may take a while until we're confident this is the right solution. There's still some lower performance when compared to TCP and I'm currently investigating that.
Dask-cudf multi partition merge slows down with
ucx
.Dask-cudf merge seems to slow down with
ucx
.Wall time: (15.4 seconds on tcp) vs (37.8 s on ucx) (exp-01)
In the attached example we see a slow down with
ucx
vs just usingtcp
.Wall Times on exp-01
UCX Time
TCP times
Repro Code:
Helper Function to create distributed dask-cudf frame
RMM Setup:
Merge Code:
The slow down happens on the merge step.
Additional Context:
There has been discussion about this on our internal slack channel, please see for more context.