Dask-cudf multi partition merge slows down with `ucx`

rapidsai / ucx-py

Python bindings for UCX

https://ucx-py.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

121 stars 58 forks source link

Dask-cudf multi partition merge slows down with `ucx` #402

Closed VibhuJawa closed 3 years ago

VibhuJawa commented 4 years ago

Dask-cudf multi partition merge slows down with `ucx` .

Dask-cudf merge seems to slow down with ucx .

Wall time: (15.4 seconds on tcp) vs (37.8 s on ucx) (exp-01)

In the attached example we see a slow down with ucx vs just using tcp .

Wall Times on exp-01

UCX Time

CPU times: user 19.3 s, sys: 1.97 s, total: 21.2 s
Wall time: 38.4 s
2945293

CPU times: user 16.7 s, sys: 1.71 s, total: 18.4 s
Wall time: 37.8 s
2943379

TCP times

CPU times: user 10.8 s, sys: 815 ms, total: 11.6 s
Wall time: 15.7 s
2944022

CPU times: user 10.9 s, sys: 807 ms, total: 11.7 s
Wall time: 15.4 s
2943697

Repro Code:

Helper Function to create distributed dask-cudf frame


import dask_cudf
import cudf
import os
import time
import dask.dataframe as dd
import dask.array as da

from dask_cuda import LocalCUDACluster
from dask.distributed import Client,wait
from dask.utils import parse_bytes

def create_random_data(n_rows=1_000,n_parts = 10, n_keys_index_1=100_000,n_keys_index_2=100,n_keys_index_3=100, col_prefix = 'a'):

    chunks = n_rows//n_parts

    df = dd.concat([
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_1'),
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_2'),
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_3'),
        da.random.randint(0, n_keys_index_1, size=n_rows,chunks = chunks ).to_dask_dataframe(columns= col_prefix + '_0'),
        da.random.randint(0, n_keys_index_2, size=n_rows, chunks = chunks ).to_dask_dataframe(columns= col_prefix +'_1'),
        da.random.randint(0, n_keys_index_3, size=n_rows, chunks = chunks ).to_dask_dataframe(columns= col_prefix +'_2'),

    ], axis=1).persist()

    gdf = df.map_partitions(cudf.from_pandas)
    gdf =  gdf.persist()
    _ = wait(gdf)
    return gdf

RMM Setup:

def setup_rmm_pool(client):
    client.run(
        cudf.set_allocator,
        pool=True,
        initial_pool_size= parse_bytes("26GB"),
        allocator="default"
    )
    return None

setup_rmm_pool(client)

Merge Code:

The slow down happens on the merge step.

rows_1, parts_1 = 140_176_770, 245
rows_2, parts_2 = 21_004_393, 171

df_1 = create_random_data(n_rows= rows_1, n_parts = parts_1, col_prefix = 'a')
df_2 = create_random_data(n_rows= rows_2,  n_parts = parts_2, col_prefix = 'b')

merged_df = df_1.merge(df_2, left_on = ['a_0','a_1','a_2'], right_on = ['b_0','b_1','b_2'])
%time len(merged_df)

Additional Context:

There has been discussion about this on our internal slack channel, please see for more context.

quasiben commented 4 years ago

Adding the dask reports for ucx/tcp:

The notable issue in the task graph is that for TCP there is more communication but the workers are fully engaged (no white space):

For UCX, there is less communication workers are not actively working on a particular task (white space):

jakirkham commented 4 years ago

Thanks Vibhu for filing and Ben for including the profiling plot.

It looks like the read time on the worker is taking ~1.5x longer in the UCX case than the TCP case. This may be helped by releasing the GIL during these operations. ( https://github.com/rapidsai/ucx-py/pull/391 )

We also are spending a lot of time in generating the __cuda_array_interface__ in Numba. Not sure why that is the case. Will look at this more closely.

beckernick commented 4 years ago

We are also spending more time working in the UCX report (3-4x the time). I'm going to be raising a similar issue with a reproducible example for groupby-aggregation behavior, today

jakirkham commented 4 years ago

Ah was looking at the gaps initially based on Ben's comment. Should I be looking at something else?

beckernick commented 4 years ago

Ah was looking at the gaps initially based on Ben's comment. Should I be looking at something else?

I just wanted to highlight the actual time spent doing compute is materially different, which may or may not related but also naively feels like a potential contributor to the slowdown.

jakirkham commented 4 years ago

Meaning the overall runtime? If so, then I think Ben has the right idea to look at gaps. If not, some clarity on the compute time you are looking at would be helpful 🙂

beckernick commented 4 years ago

Yes, sorry for being unclear 😄

From the worker profile in the linked performance report, we spend 130 total seconds doing "compute" with UCX on across the 16 GPUs. We only spend 40 total seconds with TCP across the 16 GPUs.

62 seconds doing _concat with UCX, vs 13 seconds with TCP
19 vs 9 for shuffle_group

quasiben commented 4 years ago

Another interesting thing to note is that deletion of dependencies takes quite a bit of time in UCX compared with TCP. This can be seen in the Worker Profile (administrative) page:

UCX

TCP

jakirkham commented 4 years ago

No worries. Thanks Nick! Just wanted to make sure we are focusing on the pain point 😉

Yeah that's what I was looking at above, Ben. Agree this is what we should focus on.

quasiben commented 4 years ago

It want to note that these issue could be a symptom of poor work stealing and we are still digging into this idea

jakirkham commented 4 years ago

That may be true.

There is also some cost in building Numba arrays, which we are doing. I've done some work to speed this up. Am still trying to figure out the right way to hook this into what we are doing here.

In particular please see this rough benchmark.

pentschev commented 4 years ago

Here are some timings I gathered with this code:

TCP
Create time: 10.193811178207397
Merge time: 15.630461931228638
Create time: 10.504480838775635
Merge time: 15.304280996322632
Create time: 10.684969425201416
Merge time: 16.014200448989868

UCX (Master):
Create time: 10.919841766357422
Merge time: 22.28807020187378
Create time: 11.757182836532593
Merge time: 23.014573097229004
Create time: 11.073307514190674
Merge time: 22.2984356880188

UCX (v1.7.0):
Create time: 22.65068030357361
Merge time: 45.892472982406616
Create time: 21.901089429855347
Merge time: 44.46129822731018
Create time: 22.363646984100342
Merge time: 44.13629865646362

UCX (v1.7.0 + https://github.com/openucx/ucx/pull/4646):
Create time: 10.750166177749634
Merge time: 23.12590527534485
Create time: 11.505124568939209
Merge time: 22.653675079345703
Create time: 11.365516662597656
Merge time: 20.664494276046753

As we can see above, https://github.com/openucx/ucx/pull/4646 helps (and if cherry-picked to v1.7.0, matches performance of master) but TCP still outperforms UCX.

@quasiben and I explored the __delitem__ that was mentioned before, and just for a rough test we ended up commenting it out just to see if that would affect performance, but it turns out that it doesn't change performance at all.

quasiben commented 4 years ago

In looking more thoroughly through the worker profile there seems to be a fair amount of time managing cuda contexts when using UCX:

@kkraus14 do you happen to have any insights here ?

jakirkham commented 4 years ago

Yeah that's what I was trying to get at with this comment.

Am still trying to find the right way to hook Dask into higher level RMM/dask-cuda. PR ( https://github.com/rapidsai/rmm/pull/264 ) is a start at that.

jakirkham commented 4 years ago

Also after a fairly long rabbit hole earlier today, I think rmm.auto_device may be causing some of the problems here. Basically it might be copying data unnecessarily ( https://github.com/rapidsai/rmm/issues/265 ). It's mostly unused except for this one deserialization line in cuDF when using strings (probably affects us here). Keith and I are trying to remove that currently ( https://github.com/rapidsai/cudf/pull/4003 ).

quasiben commented 4 years ago

thanks @jakirkham for looking into this

jakirkham commented 4 years ago

Have added PR ( https://github.com/rapidsai/rmm/pull/268 ) to RMM. This should make it easier to directly copy from a device pointer to a host buffer without going through a bunch of Numba machinery first. Am hoping this will be useful in cuDF and dask-cuda where we can benefit from removing some of this overhead.

quasiben commented 4 years ago

I setup two dask-cuda-workers manually and ran with nvprof to bette understand what was happening in cuda land:

==11522== Profiling application: /datasets/bzaitlen/miniconda3/envs/rapidsai-latest/bin/python /datasets/bzaitlen/miniconda3/envs/rapidsai-latest/bin/dask-cuda-worker ucx://10.33.227.163:8786 --enable-nvlink --enable-tcp-over-ucx
==11522== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   98.50%  324.77ms         1  324.77ms  324.77ms  324.77ms  cuDevicePrimaryCtxRetain
                    0.90%  2.9561ms        97  30.474us     138ns  1.2527ms  cuDeviceGetAttribute
                    0.47%  1.5458ms         1  1.5458ms  1.5458ms  1.5458ms  cuDeviceTotalMem
                    0.08%  274.87us         2  137.43us  128.12us  146.75us  cuDeviceGetName
                    0.05%  162.31us         1  162.31us  162.31us  162.31us  cuMemGetInfo
                    0.00%  4.4480us         6     741ns     182ns  1.7090us  cuDeviceGetCount
                    0.00%  3.9780us         1  3.9780us  3.9780us  3.9780us  cuDeviceGetPCIBusId
                    0.00%  2.3980us         1  2.3980us  2.3980us  2.3980us  cuCtxPushCurrent
                    0.00%  1.7690us         1  1.7690us  1.7690us  1.7690us  cuInit
                    0.00%  1.7600us         4     440ns     150ns     799ns  cudaGetDeviceCount
                    0.00%  1.6120us         3     537ns     480ns     611ns  cuDeviceGet
                    0.00%  1.0650us         1  1.0650us  1.0650us  1.0650us  cuDriverGetVersion
                    0.00%     951ns         1     951ns     951ns     951ns  cuCtxGetCurrent
                    0.00%     742ns         1     742ns     742ns     742ns  cuDeviceComputeCapability
                    0.00%     251ns         1     251ns     251ns     251ns  cuDeviceGetUuid

We are spending a lot of time with cuDevicePrimaryCtxRetain this is happening within numba and we can see it in the worker profile as well:

I believe this is known to both @jakirkham and @kkraus14

jakirkham commented 4 years ago

Thanks for doing that Ben! 😄

This coming from Numba makes sense. Though it's nice to have the additional detail provided here.

Just to make sure we are not missing anything, do we know of any other situations (outside of Numba array creation) where cuDevicePrimaryCtxRetain would be called?

Edit: Should add we are hoping PR ( https://github.com/rapidsai/rmm/pull/268 ) paves a short path for us to dispense with this overhead.

quasiben commented 4 years ago

I checked cudf/rmm/ucx and I didn't see cuDevicePrimaryCtxRetain anywhere except numba. It's worth noting that at this point we already should have a context created from the initialization step in dask-cuda. I also want to point out that the code lower in the stack comes from these two locations:

https://github.com/rapidsai/rmm/blob/branch-0.12/python/rmm/rmm.py#L142-L164

which calls:

https://github.com/rapidsai/rmm/blob/2e309eb37d2928db16ba5413fd25c9636e20c80f/python/rmm/rmm.py#L31

jakirkham commented 4 years ago

FWIW I tried adding __cuda_array_interface__ to Buffer and coercing Buffer objects to arrays as needed with PR ( https://github.com/rapidsai/cudf/pull/4023 ). This may help and is probably worth trying (assuming CI passes though can fix that tomorrow). It's probably a useful change for building other things in cuDF in the future (feedback on this is welcome).

That said, I wouldn't be surprised if we still see a fair amount of time spent in Numba due to legacy RMM allocation methods. So am planning on pushing on PR ( https://github.com/rapidsai/rmm/pull/268 ) more tomorrow in the hopes of quickly integrating this into cuDF and dask-cuda for further testing.

jakirkham commented 4 years ago

One thing that confuses me about the flame graph above (and hopefully someone can help answer this 🙂), it appears that creating an rmm.DeviceBuffer is taking a lot of time. However these are generally very fast to construct. In fact it's much faster than anything else that we might use to allocate device memory (please see benchmark below). Given this, why does it appear that line is taking a long time? 🤔

In [1]: import rmm                                                              

In [2]: rmm.reinitialize(pool_allocator=True, 
   ...:                  initial_pool_size=int(2 * 2**30))                      
Out[2]: 0

In [3]: %timeit rmm.DeviceBuffer(size=50_000_000)                               
360 ns ± 1.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: import cupy                                                             

In [5]: %timeit cupy.empty((50_000_000,), dtype="u1")                           
2.15 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: import numba.cuda                                                       

In [7]: %timeit numba.cuda.device_array((50_000_000,), dtype="u1")              
1.39 ms ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

jakirkham commented 4 years ago

Of course if the pool were not actually enabled, that would present a problem.

In [1]: import rmm                                                              

In [2]: rmm.reinitialize(pool_allocator=False)                                  
Out[2]: 0

In [3]: %timeit rmm.DeviceBuffer(size=50_000_000)                               
1.12 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

pentschev commented 4 years ago

Note that profiling dask-cuda-worker with nvprof will only gather information from the parent process. Internally, dask-cuda-worker uses Nanny, and that will fork/spawn new processes which are not catched by nvprof. Also note that the output from https://github.com/rapidsai/ucx-py/issues/402#issuecomment-580533448 doesn't have any calls to cudaMalloc nor kernels launches, which hints that that process is not doing any useful CUDA work. Finally, as already mentioned, there's only a single call to cuDevicePrimaryCtxRetain, but we would be more interested to see a number of calls to the same function during runtime, with a probable accumulated time in the order of seconds that could be causing a slowdown.

pentschev commented 4 years ago

As for rmm.DeviceBuffer, I was able yesterday to confirm the same numbers from Ben's flame graph. Apart from that I can confirm with 110% certainty that we do have the RMM pool enabled, there are two ways I can confirm:

If you watch the memory usage at runtime you can see the initial pool size gets allocated for each GPU at startup and never grows beyond that;
When you disable the pool, the flame graph reports the following:

rmm.DeviceBuffer
- TCP: 332s
- UCX: 287s

kkraus14 commented 4 years ago

@quasiben also mentioned to me that we had ~81k calls to rmm.DeviceBuffer in that flamechart. I believe the pool allocator of rmm (cnmem) is known to have performance issues with freeing memory once there's a certain number of allocations being managed. As a quick test it may be worth testing what happens when you create 10/100k allocations, store them somewhere, and then see how long it takes to delete the references.

kkraus14 commented 4 years ago

Wrote a quick test:

import rmm
import time

rmm.reinitialize(pool_allocator=True, initial_pool_size=int(2**34))

num_allocations = 10000
allocations = [None] * num_allocations

start = time.time()
for i in range(num_allocations):
    allocations[i] = rmm.DeviceBuffer(size=100)
end = time.time()
time_result = (end - start) * 1000

print(f"Time taken for allocating {num_allocations} buffers: {time_result}ms")

start = time.time()
for i in range(num_allocations):
    allocations[i] = None
end = time.time()
time_result = (end - start) * 1000

print(f"Time taken for freeing {num_allocations} buffers: {time_result}ms")

Results:

Time taken for allocating 100000 buffers: 44.32559013366699ms
Time taken for freeing 100000 buffers: 22098.9887714386ms

Time taken for allocating 50000 buffers: 23.27561378479004ms
Time taken for freeing 50000 buffers: 5765.538692474365ms

Time taken for allocating 25000 buffers: 11.168956756591797ms
Time taken for freeing 25000 buffers: 1489.1653060913086ms

Time taken for allocating 10000 buffers: 5.175113677978516ms
Time taken for freeing 10000 buffers: 360.734224319458ms

jakirkham commented 4 years ago

Note that profiling dask-cuda-worker with nvprof will only gather information from the parent process. Internally, dask-cuda-worker uses Nanny, and that will fork/spawn new processes which are not catched by nvprof. Also note that the output from #402 (comment) doesn't have any calls to cudaMalloc nor kernels launches, which hints that that process is not doing any useful CUDA work. Finally, as already mentioned, there's only a single call to cuDevicePrimaryCtxRetain, but we would be more interested to see a number of calls to the same function during runtime, with a probable accumulated time in the order of seconds that could be causing a slowdown.

IDK if this would help, but one could add a sitecustomize.py in the site-packages directory of the Conda environment and add custom Python code that will be run by each interpreter at startup. This could allow us to gather and dump a bunch of profiling statistics from all interpreters in some common directory. Though I don't know if one can configure nvprof to inspect an existing process. If so, there is probably some ugly ctypes code we could write to enable profiling that process.

Barring that one might rename the pythonX.Y executable to something like pythonX.Y-orig and add a shell script that starts the Python interpreter with nvprof. Would probably also need to fix all symlinks (like python, pythonX) to point at this script. Not sure how well this would work with forking, but it could be another way to profile the cluster of processes.

pentschev commented 4 years ago

After a lot more profiling I was able to pinpoint the primary issue with workflows such as this one, see image below:

What happens here is UCX is not always transferring data over NVLink, but over TCP (which incurs in DtoH and HtoD copies), even worse it breaks down each transfer into segments of 8KB, forcing a copy+stream synchronization for many times, making that process extremely slow. The stats of memory copy looks as follows:

CUDA Memory Operation Statistics (nanoseconds)

Time(%)      Total Time  Operations         Average         Minimum         Maximum  Name                              
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   55.6      8407106705     2520239          3335.8            1760         4781363  [CUDA memcpy HtoD]                
   42.8      6466457263     2578038          2508.3            1120          204227  [CUDA memcpy DtoH]                
    1.4       207633644       94986          2185.9            1280           24224  [CUDA memcpy DtoD]                
    0.3        38804550        1279         30339.8            7104          441828  [CUDA memcpy PtoP]                
    0.0         1577193         719          2193.6            1824            3456  [CUDA memset]                     

CUDA Memory Operation Statistics (KiB)

              Total      Operations              Average            Minimum              Maximum  Name                 
-------------------  --------------  -------------------  -----------------  -------------------  --------------------------------------------------------------------------------
            933.324             719                1.298              0.004                1.500  [CUDA memset]        
        4699152.953            1279             3674.084            662.445             4469.922  [CUDA memcpy PtoP]   
       28620716.464         2520239               11.356              0.001             4469.922  [CUDA memcpy HtoD]   
       28080943.471           94986              295.632              0.004             4469.922  [CUDA memcpy DtoD]   
       19852018.817         2578038                7.700              0.001              294.039  [CUDA memcpy DtoH]

As can be noticed, only 0.3% of all time spent transferring memory is happening over NVLink (as seen by PtoP). There's also 1.4% of transfer time happening in DtoD (AFAIK, that means transfers are going from one device to another via host, i.e., DtoH on source device + HtoD on target device), which is not clear to me why it happens given this is running on a DGX-2 and all devices should have an NVLink connection due to the NVSwitch.

The TCP segment size is configured via UCX_TCP_TX_SEG_SIZE (default for this is 8KB, as mentioned previously) and UCX_TCP_RX_SEG_SIZE, and I confirmed increasing those reduce the number of copies+synchronization, with an 8MB size reducing merge compute time from ~22 seconds to ~16 seconds (still marginally slower than regular Python sockets). When the segment sizes are increased I noticed that no copies occur in PtoP (meaning 100% of transfers are going over TCP), as seen below:

CUDA Memory Operation Statistics (nanoseconds)

Time(%)      Total Time  Operations         Average         Minimum         Maximum  Name                              
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   60.4      3695125675       63249         58421.9            1760         3222347  [CUDA memcpy HtoD]                
   37.4      2291190454       92884         24667.2            1376         2480377  [CUDA memcpy DtoH]                
    2.2       132241437       54952          2406.5            1280           32032  [CUDA memcpy DtoD]                
    0.0          364322         160          2277.0            1792            9728  [CUDA memset]                     

CUDA Memory Operation Statistics (KiB)

              Total      Operations              Average            Minimum              Maximum  Name                 
-------------------  --------------  -------------------  -----------------  -------------------  --------------------------------------------------------------------------------
       25365207.218           63249              401.037              0.001             4469.922  [CUDA memcpy HtoD]   
       18847178.518           54952              342.975              0.004             4469.922  [CUDA memcpy DtoD]   
       16605241.842           92884              178.774              0.001             4469.922  [CUDA memcpy DtoH]   
            162.906             160                1.018              0.004                1.500  [CUDA memset]

I was not able yet to find how UCX is determining whether a transfer should go over TCP or over NVLink, but I expected that for the example being discussed in this thread 100% of them would go over NVLink, given that we're not reading from or writing to host explicitly.

I will continue to look for answers here, and I was wondering if @Akshay-Venkatesh perhaps has an idea if we're doing something wrong or how we could better configure things in case this is a misconfiguration issue.

Akshay-Venkatesh commented 4 years ago

@pentschev Thanks for the details

What happens here is UCX is not always transferring data over NVLink, but over TCP (which incurs in DtoH and HtoD copies), even worse it breaks down each transfer into segments of 8KB, forcing a copy+stream synchronization for many times, making that process extremely slow. The stats of memory copy looks as follows:

This would happen if it's not possible to use cuda_ipc transport. This could mean a couple of things -- no NVLINK or no PCIe-based peer accessibility; or it could mean that the memory handle obtained from the peer could not be opened at the mapping end for some reason. Some trace reports would help to see if we're falling in the latter category. The 8KB D->H and H->D transfers are part of the fallback rendezvous protocol in UCP (i.e when a high performance path like NVLINK or RDMA fails). Why this occurs on a DGX-2 is something that needs to be found out.

I was not able yet to find how UCX is determining whether a transfer should go over TCP or over NVLink, but I expected that for the example being discussed in this thread 100% of them would go over NVLink, given that we're not reading from or writing to host explicitly.

I expect 100% to go over cuda_ipc as well and you should be seeing PtoP traffic predominantly.

Were about to repro this result in a standalone benchmark by any chance?

Akshay-Venkatesh commented 4 years ago

2-process ucx_pertest on current master seems to be working as expected. Correct me if I'm making assumptions.

==297280== NVPROF is profiling process 297280, command: ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000
==292777== NVPROF is profiling process 292777, command: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 1000000
           100     0.000    31.931    31.931   29866.35   29866.35       31317       31317
==292777== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 1000000
==292777== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   82.86%  443.37ms         2  221.68ms  8.3250us  443.36ms  cudaMalloc
                   12.55%  67.171ms      1552  43.279us     103ns  2.9163ms  cuDeviceGetAttribute
                    3.40%  18.187ms        16  1.1367ms  989.58us  1.1752ms  cuDeviceTotalMem
                    0.97%  5.1798ms        16  323.74us  269.01us  335.27us  cuDeviceGetName
                    0.18%  952.45us         2  476.23us  31.381us  921.07us  cudaFree
                    0.01%  30.926us       110     281ns     225ns  3.8260us  cuIpcGetMemHandle
                    0.01%  30.168us       110     274ns     224ns  2.1050us  cuPointerGetAttributes
                    0.01%  28.661us       112     255ns     215ns  1.2740us  cuMemGetAddressRange
                    0.00%  24.829us       112     221ns     183ns  2.3250us  cuPointerSetAttribute
                    0.00%  22.185us       126     176ns     140ns     395ns  cuDeviceGetUuid
                    0.00%  21.645us        16  1.3520us     726ns  5.9870us  cuDeviceGetPCIBusId
                    0.00%  18.064us       112     161ns     151ns     255ns  cuCtxGetDevice
                    0.00%  5.8220us        32     181ns     103ns     390ns  cuDeviceGet
                    0.00%  5.4800us         1  5.4800us  5.4800us  5.4800us  cudaSetDevice
                    0.00%  4.5580us         7     651ns     167ns  2.4310us  cudaGetDeviceCount
                    0.00%  3.2320us         5     646ns     231ns  1.3340us  cuDeviceGetCount
                    0.00%     798ns         2     399ns     192ns     606ns  cuCtxGetCurrent
                    0.00%     692ns         2     346ns     181ns     511ns  cuDevicePrimaryCtxGetState
==297280== Profiling application: ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000
==297280== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.0293ms       110  9.3570us  9.1520us  10.624us  [CUDA memcpy PtoP]
      API calls:   81.96%  445.53ms         2  222.76ms  8.5990us  445.52ms  cudaMalloc
                   12.80%  69.566ms      1552  44.823us     115ns  2.9094ms  cuDeviceGetAttribute
                    3.33%  18.079ms        16  1.1299ms  857.21us  1.1895ms  cuDeviceTotalMem
                    0.97%  5.2471ms        16  327.94us  278.28us  350.49us  cuDeviceGetName
                    0.23%  1.2261ms      2593     472ns     371ns  5.7110us  cuEventQuery
                    0.20%  1.0700ms         2  534.99us  30.002us  1.0400ms  cudaFree
                    0.15%  793.71us         1  793.71us  793.71us  793.71us  cuIpcOpenMemHandle
                    0.10%  551.29us         1  551.29us  551.29us  551.29us  cuIpcCloseMemHandle
                    0.09%  498.03us        16  31.126us  1.7630us  455.24us  cuStreamCreate
                    0.09%  471.86us       110  4.2890us  3.7760us  23.013us  cuMemcpyDtoDAsync
                    0.02%  85.216us        16  5.3260us  4.5490us  13.059us  cuStreamDestroy
                    0.02%  82.512us       460     179ns     157ns     474ns  cuCtxGetDevice
                    0.01%  53.665us       110     487ns     424ns  3.0990us  cuEventRecord
                    0.01%  40.689us       128     317ns     281ns     902ns  cuEventCreate
                    0.01%  39.363us       110     357ns     326ns     858ns  cuIpcGetMemHandle
                    0.01%  38.705us       112     345ns     301ns  2.7210us  cuMemGetAddressRange
                    0.01%  36.570us       110     332ns     279ns  2.4450us  cuPointerGetAttributes
                    0.01%  31.836us       128     248ns     229ns     797ns  cuEventDestroy
                    0.01%  30.781us       112     274ns     244ns  2.1880us  cuPointerSetAttribute
                    0.00%  23.897us       115     207ns     166ns  2.0950us  cuDeviceGetCount
                    0.00%  23.796us       130     183ns     166ns     683ns  cuDevicePrimaryCtxGetState
                    0.00%  23.452us        16  1.4650us     819ns  6.1190us  cuDeviceGetPCIBusId
                    0.00%  20.676us       130     159ns     153ns     427ns  cuCtxGetCurrent
                    0.00%  6.9260us         1  6.9260us  6.9260us  6.9260us  cudaSetDevice
                    0.00%  6.9130us        32     216ns     132ns     405ns  cuDeviceGet
                    0.00%  5.3340us         7     762ns     162ns  3.2010us  cudaGetDeviceCount
                    0.00%  2.8440us        16     177ns     153ns     231ns  cuDeviceGetUuid

[1]-  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 1000000
[2]+  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000

Performance looks reasonable as well:

[akvenkatesh@dgx2-03 build]$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000 &
[2] 35613
[akvenkatesh@dgx2-03 build]$ [1580764116.835883] [dgx2-03:35613:0]       perftest.c:1416 UCX  WARN  CPU affinity is not set (bound to 40 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+------------------------------------------------------------------------------------------+
| API:          protocol layer                                                             |
| Test:         tag match bandwidth                                                        |
| Data layout:  (automatic)                                                                |
| Send memory:  cuda                                                                       |
| Recv memory:  cuda                                                                       |
| Message size: 1000000                                                                    |
+------------------------------------------------------------------------------------------+
           100     0.000    21.999    21.999   43351.04   43351.04       45457       45457

[1]-  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 1000000
[2]+  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 1000000
[akvenkatesh@dgx2-03 build]$
[akvenkatesh@dgx2-03 build]$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000000 &
[1] 37281
[akvenkatesh@dgx2-03 build]$ [1580764126.139641] [dgx2-03:37281:0]       perftest.c:1416 UCX  WARN  CPU affinity is not set (bound to 40 cpus). Performance may be impacted.
Waiting for connection...

[akvenkatesh@dgx2-03 build]$
[akvenkatesh@dgx2-03 build]$
[akvenkatesh@dgx2-03 build]$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 10000000 &
[2] 38365
[akvenkatesh@dgx2-03 build]$ [1580764132.739735] [dgx2-03:38365:0]       perftest.c:1416 UCX  WARN  CPU affinity is not set (bound to 40 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+------------------------------------------------------------------------------------------+
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
| API:          protocol layer                                                             |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Test:         tag match bandwidth                                                        |
| Data layout:  (automatic)                                                                |
| Send memory:  cuda                                                                       |
| Recv memory:  cuda                                                                       |
| Message size: 10000000                                                                   |
+------------------------------------------------------------------------------------------+
           100     0.000    84.240    84.240  113208.62  113208.62       11871       11871

[1]-  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000000
[2]+  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc ./bin/ucx_perftest dgx2-03 -t tag_bw -m cuda -n 100 -s 10000000

Will update with an alltoall benchmark to see if something breaks maybe with 16 GPUs.

Akshay-Venkatesh commented 4 years ago

Alltoall looks to run as expected as well:

mpirun -np 16 -host dgx2-03 --oversubscribe -npernode 16 -x LD_LIBRARY_PATH --mca pml  ucx --mca btl ^openib,smcuda -x UCX_MEMTYPE_CACHE=n -x UCX_RNDV_THRESH=8192 -x UCX_TLS=rc,cuda_copy,cuda_ipc,mm ./get_local_ompi_rank `which nvprof` ./mpi/collective/osu_alltoall -m 1048576:1048576 -d cuda &> dump
[akvenkatesh@dgx2-03 osu-micro-benchmarks-5.6.2]$ cat dump | grep -r "CUDA memcpy"
dump: GPU activities:   91.49%  16.768ms      1650  10.162us  9.6000us  11.585us  [CUDA memcpy PtoP]
dump:                    8.49%  1.5555ms       880  1.7670us  1.4720us  9.2480us  [CUDA memcpy DtoD]
dump: GPU activities:   91.44%  16.779ms      1650  10.169us  9.6000us  11.136us  [CUDA memcpy PtoP]
dump:                    8.54%  1.5661ms       880  1.7790us  1.4720us  10.176us  [CUDA memcpy DtoD]
dump: GPU activities:   91.49%  16.739ms      1650  10.145us  9.5680us  10.880us  [CUDA memcpy PtoP]
dump:                    8.49%  1.5527ms       880  1.7640us  1.4720us  2.6880us  [CUDA memcpy DtoD]
dump: GPU activities:   91.43%  16.744ms      1650  10.147us  9.5050us  11.040us  [CUDA memcpy PtoP]
dump:                    8.55%  1.5653ms       880  1.7780us  1.4720us  9.4080us  [CUDA memcpy DtoD]
dump: GPU activities:   91.54%  16.886ms      1650  10.233us  9.6320us  18.016us  [CUDA memcpy PtoP]
dump:                    8.44%  1.5575ms       880  1.7690us  1.4720us  9.0240us  [CUDA memcpy DtoD]
dump: GPU activities:   91.48%  16.855ms      1650  10.215us  9.6010us  10.944us  [CUDA memcpy PtoP]
dump: GPU activities:   91.49%  16.769ms      1650  10.163us  9.5680us  21.504us  [CUDA memcpy PtoP]
dump:                    8.50%  1.5658ms       880  1.7790us  1.5040us  2.4960us  [CUDA memcpy DtoD]
dump: GPU activities:   91.56%  16.820ms      1650  10.193us  9.5680us  11.488us  [CUDA memcpy PtoP]
dump:                    8.42%  1.5463ms       880  1.7570us  1.4720us  8.7040us  [CUDA memcpy DtoD]
dump:                    8.49%  1.5555ms       880  1.7670us  1.4720us  9.7920us  [CUDA memcpy DtoD]
dump: GPU activities:   91.50%  16.833ms      1650  10.201us  9.5680us  11.200us  [CUDA memcpy PtoP]
dump:                    8.48%  1.5597ms       880  1.7720us  1.4720us  14.945us  [CUDA memcpy DtoD]
dump: GPU activities:   91.56%  16.879ms      1650  10.229us  9.6320us  11.872us  [CUDA memcpy PtoP]
dump:                    8.41%  1.5512ms       880  1.7620us  1.4720us  2.7200us  [CUDA memcpy DtoD]
dump: GPU activities:   91.51%  16.868ms      1650  10.223us  9.6640us  21.696us  [CUDA memcpy PtoP]
dump: GPU activities:   91.51%  16.771ms      1650  10.164us  9.5680us  11.040us  [CUDA memcpy PtoP]
dump:                    8.47%  1.5522ms       880  1.7630us  1.4720us  9.6970us  [CUDA memcpy DtoD]
dump:                    8.47%  1.5611ms       880  1.7730us  1.4720us  14.144us  [CUDA memcpy DtoD]
dump: GPU activities:   91.52%  16.835ms      1650  10.202us  9.6000us  11.104us  [CUDA memcpy PtoP]
dump:                    8.46%  1.5562ms       880  1.7680us  1.5040us  2.9760us  [CUDA memcpy DtoD]
dump: GPU activities:   91.53%  16.754ms      1650  10.153us  9.4720us  11.168us  [CUDA memcpy PtoP]
dump:                    8.45%  1.5464ms       880  1.7570us  1.4720us  10.112us  [CUDA memcpy DtoD]
dump: GPU activities:   91.44%  16.749ms      1650  10.151us  9.5360us  11.104us  [CUDA memcpy PtoP]
dump:                    8.54%  1.5651ms       880  1.7780us  1.4720us  9.7920us  [CUDA memcpy DtoD]
dump: GPU activities:   91.57%  16.822ms      1650  10.195us  9.6320us  17.888us  [CUDA memcpy PtoP]
dump:                    8.41%  1.5455ms       880  1.7560us  1.4720us  2.7200us  [CUDA memcpy DtoD]

pentschev commented 4 years ago

Some trace reports would help to see if we're falling in the latter category.

I've been trying to go through the logs today, but there's a lot there I don't quite understand yet, do you have any hints on what we could be looking for?

As for reproducers, the easiest one is to use ucx_perftest, but it requires a small enough transfer size. For example, running with a size of 1MB I will always see PtoP, but when running with a reasonably small size (100KB or less), I will always see DtoH/HtoD. I was also able to reproduce this with a simple UCX-Py server/client code, but sizes were different then, with 40KB and up going PtoP and 20KB or less going DtoH/HtoD (didn't test anything in between 20 and 40KB). If you're interested I can share the Python code, but it seems like running with ucx_perftest with small enough transfers should have the same effect. It's still a mystery to me why such sizes vary from one application to another though. The tests I described here were all performed on a DGX-1 using only GPUs 0 and 1.

Akshay-Venkatesh commented 4 years ago

Some trace reports would help to see if we're falling in the latter category.

I've been trying to go through the logs today, but there's a lot there I don't quite understand yet, do you have any hints on what we could be looking for?

I would look for registration failure notifications from cuda_ipc transport. You could grep through instances of cuda_ipc_md.c or cuda_ipc_cache.c to see if you notice registration failures or errors opening memory handles.

As for reproducers, the easiest one is to use ucx_perftest, but it requires a small enough transfer size. For example, running with a size of 1MB I will always see PtoP, but when running with a reasonably small size (100KB or less), I will always see DtoH/HtoD. I was also able to reproduce this with a simple UCX-Py server/client code, but sizes were different then, with 40KB and up going PtoP and 20KB or less going DtoH/HtoD (didn't test anything in between 20 and 40KB). If you're interested I can share the Python code, but it seems like running with ucx_perftest with small enough transfers should have the same effect. It's still a mystery to me why such sizes vary from one application to another though. The tests I described here were all performed on a DGX-1 using only GPUs 0 and 1.

@pentschev Can you check if I'm not using the env variables that you use? I don't see an issue with ucx_perftest with 100k or 10k.

100K:

==278438== ==278402== NVPROF is profiling process 278438, command: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 100000
NVPROF is profiling process 278402, command: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 100000
           100     0.000    26.200    26.200    3640.00    3640.00       38168       38168
==278402== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 100000
==278402== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   81.94%  459.52ms         2  229.76ms  7.8410us  459.51ms  cudaMalloc
                   13.28%  74.461ms      1552  47.977us     107ns  2.8405ms  cuDeviceGetAttribute
                    3.46%  19.408ms        16  1.2130ms  1.1445ms  1.4109ms  cuDeviceTotalMem
                    1.13%  6.3248ms        16  395.30us  328.37us  495.10us  cuDeviceGetName
                    0.16%  919.77us         2  459.89us  21.825us  897.95us  cudaFree
                    0.01%  29.488us       110     268ns     207ns  3.7290us  cuIpcGetMemHandle
                    0.00%  26.155us       110     237ns     188ns  2.3020us  cuPointerGetAttributes
                    0.00%  26.118us       112     233ns     196ns  1.1900us  cuMemGetAddressRange
                    0.00%  25.794us        16  1.6120us     851ns  5.0020us  cuDeviceGetPCIBusId
                    0.00%  22.919us       112     204ns     166ns  2.2910us  cuPointerSetAttribute
                    0.00%  22.622us       126     179ns     142ns     559ns  cuDeviceGetUuid
                    0.00%  17.974us       112     160ns     140ns     300ns  cuCtxGetDevice
                    0.00%  6.3480us         1  6.3480us  6.3480us  6.3480us  cudaSetDevice
                    0.00%  5.7830us        32     180ns     114ns     400ns  cuDeviceGet
                    0.00%  4.2680us         7     609ns     133ns  2.2300us  cudaGetDeviceCount
                    0.00%  3.0450us         5     609ns     109ns  1.4950us  cuDeviceGetCount
                    0.00%     685ns         2     342ns     189ns     496ns  cuCtxGetCurrent
                    0.00%     441ns         2     220ns     144ns     297ns  cuDevicePrimaryCtxGetState
==278438== Profiling application: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 100000
==278438== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  342.18us       110  3.1100us  3.0080us  4.0320us  [CUDA memcpy PtoP]
      API calls:   81.07%  459.34ms         2  229.67ms  8.5820us  459.34ms  cudaMalloc
                   13.67%  77.483ms      1552  49.924us     116ns  2.8541ms  cuDeviceGetAttribute
                    3.39%  19.233ms        16  1.2021ms  805.25us  1.4889ms  cuDeviceTotalMem
                    0.97%  5.4941ms        16  343.38us  297.27us  489.64us  cuDeviceGetName
                    0.20%  1.1113ms         2  555.63us  36.031us  1.0752ms  cudaFree
                    0.15%  859.67us         1  859.67us  859.67us  859.67us  cuIpcOpenMemHandle
                    0.15%  836.96us      1689     495ns     348ns  11.004us  cuEventQuery
                    0.10%  575.64us        16  35.977us  1.8620us  527.82us  cuStreamCreate
                    0.10%  553.14us         1  553.14us  553.14us  553.14us  cuIpcCloseMemHandle
                    0.10%  547.57us       110  4.9770us  4.1250us  29.591us  cuMemcpyDtoDAsync
                    0.02%  96.098us        16  6.0060us  4.9500us  14.927us  cuStreamDestroy
                    0.01%  80.943us       460     175ns     146ns     941ns  cuCtxGetDevice
                    0.01%  52.325us       110     475ns     401ns  3.2800us  cuEventRecord
                    0.01%  49.908us       128     389ns     278ns  1.0000us  cuEventCreate
                    0.01%  39.616us       128     309ns     239ns     717ns  cuEventDestroy
                    0.01%  36.986us       110     336ns     295ns  1.3200us  cuIpcGetMemHandle
                    0.01%  34.920us       112     311ns     275ns     780ns  cuMemGetAddressRange
                    0.01%  33.906us       110     308ns     250ns  2.9390us  cuPointerGetAttributes
                    0.01%  29.237us       112     261ns     221ns  2.2320us  cuPointerSetAttribute
                    0.00%  25.651us       130     197ns     183ns     636ns  cuDevicePrimaryCtxGetState
                    0.00%  23.243us        16  1.4520us     697ns  6.6430us  cuDeviceGetPCIBusId
                    0.00%  21.337us       130     164ns     157ns     516ns  cuCtxGetCurrent
                    0.00%  21.250us       115     184ns     150ns     969ns  cuDeviceGetCount
                    0.00%  7.8710us         1  7.8710us  7.8710us  7.8710us  cudaSetDevice
                    0.00%  6.8170us        32     213ns     144ns     540ns  cuDeviceGet
                    0.00%  5.4290us         7     775ns     131ns  3.4120us  cudaGetDeviceCount
                    0.00%  3.2900us        16     205ns     156ns     572ns  cuDeviceGetUuid

[1]-  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 100000
[2]+  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 100000

10K:

==278664== NVPROF is profiling process 278664, command: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==278699== NVPROF is profiling process 278699, command: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
           100     0.000    24.152    24.152     394.87     394.87       41405       41405
==278664== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==278664== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   82.28%  473.66ms         2  236.83ms  10.215us  473.65ms  cudaMalloc
                   12.95%  74.519ms      1552  48.014us     103ns  2.7595ms  cuDeviceGetAttribute
                    3.47%  19.986ms        16  1.2491ms  663.96us  2.5323ms  cuDeviceTotalMem
                    1.11%  6.3640ms        16  397.75us  165.70us  661.48us  cuDeviceGetName
                    0.16%  913.62us         2  456.81us  26.705us  886.91us  cudaFree
                    0.00%  28.552us       110     259ns     212ns  4.0810us  cuIpcGetMemHandle
                    0.00%  27.676us       110     251ns     180ns  6.0480us  cuPointerGetAttributes
                    0.00%  27.349us       112     244ns     211ns  1.9520us  cuMemGetAddressRange
                    0.00%  23.671us       112     211ns     168ns  2.2680us  cuPointerSetAttribute
                    0.00%  22.160us        16  1.3850us     699ns  6.0630us  cuDeviceGetPCIBusId
                    0.00%  21.742us       126     172ns     143ns     726ns  cuDeviceGetUuid
                    0.00%  16.242us       112     145ns     137ns     244ns  cuCtxGetDevice
                    0.00%  7.9580us         1  7.9580us  7.9580us  7.9580us  cudaSetDevice
                    0.00%  6.9000us        32     215ns     118ns     738ns  cuDeviceGet
                    0.00%  5.7260us         7     818ns     147ns  3.7480us  cudaGetDeviceCount
                    0.00%  2.2730us         5     454ns     241ns     843ns  cuDeviceGetCount
                    0.00%     709ns         2     354ns     158ns     551ns  cuDevicePrimaryCtxGetState
                    0.00%     681ns         2     340ns     161ns     520ns  cuCtxGetCurrent
==278699== Profiling application: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
==278699== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  247.91us       110  2.2530us  2.2080us  2.6570us  [CUDA memcpy PtoP]
      API calls:   82.06%  469.13ms         2  234.56ms  9.9870us  469.12ms  cudaMalloc
                   12.61%  72.115ms      1552  46.465us     101ns  3.4354ms  cuDeviceGetAttribute
                    3.44%  19.692ms        16  1.2307ms  650.41us  1.7946ms  cuDeviceTotalMem
                    1.01%  5.7823ms        16  361.39us  164.06us  656.16us  cuDeviceGetName
                    0.19%  1.0938ms         2  546.90us  32.128us  1.0617ms  cudaFree
                    0.15%  850.92us         1  850.92us  850.92us  850.92us  cuIpcOpenMemHandle
                    0.15%  843.02us      1764     477ns     342ns  6.7340us  cuEventQuery
                    0.10%  577.28us        16  36.080us  1.6030us  530.96us  cuStreamCreate
                    0.10%  564.59us         1  564.59us  564.59us  564.59us  cuIpcCloseMemHandle
                    0.09%  501.32us       110  4.5570us  3.9720us  26.119us  cuMemcpyDtoDAsync
                    0.02%  90.220us        16  5.6380us  4.5560us  13.454us  cuStreamDestroy
                    0.01%  72.425us       461     157ns     132ns     899ns  cuCtxGetDevice
                    0.01%  46.347us       110     421ns     368ns  3.4610us  cuEventRecord
                    0.01%  45.085us       129     349ns     247ns     875ns  cuEventCreate
                    0.01%  39.424us       129     305ns     228ns  1.1890us  cuEventDestroy
                    0.01%  33.959us       112     303ns     264ns     822ns  cuMemGetAddressRange
                    0.01%  33.356us       110     303ns     222ns  5.6730us  cuPointerGetAttributes
                    0.01%  32.970us       110     299ns     249ns  1.0340us  cuIpcGetMemHandle
                    0.00%  26.858us       112     239ns     198ns  2.0650us  cuPointerSetAttribute
                    0.00%  21.057us       131     160ns     147ns     441ns  cuDevicePrimaryCtxGetState
                    0.00%  20.830us        16  1.3010us     711ns  4.7010us  cuDeviceGetPCIBusId
                    0.00%  19.443us       115     169ns     144ns     780ns  cuDeviceGetCount
                    0.00%  18.469us       131     140ns     133ns     432ns  cuCtxGetCurrent
                    0.00%  6.9110us         1  6.9110us  6.9110us  6.9110us  cudaSetDevice
                    0.00%  6.2780us        32     196ns     111ns     426ns  cuDeviceGet
                    0.00%  5.3400us         7     762ns     136ns  3.0850us  cudaGetDeviceCount
                    0.00%  3.0800us        16     192ns     131ns     293ns  cuDeviceGetUuid

[1]-  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
[2]+  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=rc,mm,cuda_copy,cuda_ipc `which nvprof` ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000

I'm still seeing 100% activity on PtoP channel. This is all on dgx-2 and NOT dgx-1. On dgx-1 you can expect non-PtoP for devices that are not IPC accessible (like GPU 0 and 5 for example). That said, I definitely don't see issues with DGX-1 for peer accessible GPUs.

pentschev commented 4 years ago

Interesting, using the same variables as you do I see PtoP even for buffers as small as 10KB -- meaning that works. The variables I've been using (and that we're using in Dask generally) are: UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm. I tried quickly to use the same variables that you're using in Dask, but it didn't immediately work (I see some crashes), I'll have to debug it further.

quasiben commented 4 years ago

~~[Ordering in UCX_TLS does not matter, correct] ?~~

https://github.com/openucx/ucx/blob/b507c4fc5931057017d5acf71a13a8f78b77c28a/src/ucp/core/ucp_context.c#L75

quasiben commented 4 years ago

for references, the data we send typically looks like the following (mix of host and gpu memory):

In [17]: frames
Out[17]:
[b'',
 b'\x80',
 b'\x83\xa7headers\x81\x91\xa4data\x8a\xa4type\xc4$\x80\x03ccudf.core.dataframe\nDataFrame\nq\x00.\xa5index\x85\xacindex_column\x83\xa4type\xc41\x80\x03ccudf.core.column.numerical\nNumericalColumn\nq\x00.\xa5dtype\xa3<i8\xabframe_count\x01\xa4name\xc4\x04\x80\x03N.\xa5dtype\xc4C\x80\x03cnumpy\ndtype\nq\x00X\x02\x00\x00\x00i8q\x01K\x00K\x01\x87q\x02Rq\x03(K\x03X\x01\x00\x00\x00<q\x04NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\x05b.\xa4type\xc4#\x80\x03ccudf.core.index\nGenericIndex\nq\x00.\xabframe_count\x01\xb1index_frame_count\x01\xaccolumn_names\xc4\x16\x80\x03X\x01\x00\x00\x00aq\x00X\x01\x00\x00\x00bq\x01\x86q\x02.\xa7columns\x92\x83\xa4type\xc41\x80\x03ccudf.core.column.numerical\nNumericalColumn\nq\x00.\xa5dtype\xa3<f8\xabframe_count\x01\x83\xa4type\xc41\x80\x03ccudf.core.column.numerical\nNumericalColumn\nq\x00.\xa5dtype\xa3<i8\xabframe_count\x01\xaftype-serialized\xc40\x80\x04\x95%\x00\x00\x00\x00\x00\x00\x00\x8c\x13cudf.core.dataframe\x94\x8c\tDataFrame\x94\x93\x94.\xaaserializer\xa4cuda\xabcompression\x93\xc0\xc0\xc0\xa7lengths\x93\xce\x00z\x12\x00\xce\x00z\x12\x00\xce\x00z\x12\x00\xa5count\x03\xa4keys\x91\x91\xa4data\xabbytestrings\x90',
 <numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f24fd997c50>,
 <numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f24fd99b110>,
 <numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f24fd99b290>]

jakirkham commented 4 years ago

This is a bit orthogonal to the present conversation (though relates directly to acquiring contexts that came up earlier). @madsbk put together PR ( https://github.com/numba/numba/pull/5189 ), which should make checking __cuda_array_interface__ on Numba DeviceNDArray's not require a context (IOW not calling cuDevicePrimaryCtxRetain). Could be causing some of the overhead we are seeing here.

Akshay-Venkatesh commented 4 years ago

UCX_SOCKADDR_TLS_PRIORITY

Same results with the parameters you provided @pentschev

10M:

==304392== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000000
==304392== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   83.37%  438.38ms         2  219.19ms  570.77us  437.81ms  cudaMalloc
                   11.89%  62.533ms      1552  40.292us     108ns  1.8181ms  cuDeviceGetAttribute
                    3.36%  17.678ms        16  1.1049ms  645.88us  1.1560ms  cuDeviceTotalMem
                    0.95%  5.0202ms        16  313.76us  162.19us  336.13us  cuDeviceGetName
                    0.37%  1.9695ms         2  984.76us  927.43us  1.0421ms  cudaFree
                    0.01%  33.851us       110     307ns     269ns  2.1330us  cuPointerGetAttributes
                    0.01%  33.243us       112     296ns     261ns  1.9720us  cuMemGetAddressRange
                    0.01%  33.220us       110     302ns     257ns  3.5170us  cuIpcGetMemHandle
                    0.00%  25.926us       112     231ns     191ns  2.4050us  cuPointerSetAttribute
                    0.00%  22.709us       126     180ns     147ns     444ns  cuDeviceGetUuid
                    0.00%  21.628us       112     193ns     155ns  3.3630us  cuCtxGetDevice
                    0.00%  20.914us        16  1.3070us     761ns  5.8450us  cuDeviceGetPCIBusId
                    0.00%  8.8660us        32     277ns     119ns  3.1790us  cuDeviceGet
                    0.00%  5.9510us         1  5.9510us  5.9510us  5.9510us  cudaSetDevice
                    0.00%  4.5130us         7     644ns     145ns  2.2140us  cudaGetDeviceCount
                    0.00%  2.8480us         5     569ns     126ns  1.0750us  cuDeviceGetCount
                    0.00%     866ns         2     433ns     251ns     615ns  cuCtxGetCurrent
                    0.00%     773ns         2     386ns     221ns     552ns  cuDevicePrimaryCtxGetState
==304427== Profiling application: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000000
==304427== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  7.8750ms       110  71.591us  71.329us  72.801us  [CUDA memcpy PtoP]
      API calls:   81.69%  437.28ms         2  218.64ms  864.81us  436.41ms  cudaMalloc
                   12.35%  66.104ms      1552  42.592us     111ns  1.8033ms  cuDeviceGetAttribute
                    3.29%  17.617ms        16  1.1011ms  658.89us  1.1626ms  cuDeviceTotalMem
                    0.96%  5.1255ms        16  320.34us  163.96us  368.76us  cuDeviceGetName
                    0.54%  2.8884ms      7078     408ns     348ns  3.7320us  cuEventQuery
                    0.41%  2.1685ms         2  1.0842ms  1.0495ms  1.1190ms  cudaFree
                    0.35%  1.8490ms         1  1.8490ms  1.8490ms  1.8490ms  cuIpcOpenMemHandle
                    0.11%  603.28us       110  5.4840us  4.8090us  28.923us  cuMemcpyDtoDAsync
                    0.10%  560.34us        16  35.021us  1.6480us  513.85us  cuStreamCreate
                    0.10%  553.29us         1  553.29us  553.29us  553.29us  cuIpcCloseMemHandle
                    0.02%  94.678us        16  5.9170us  4.6620us  17.035us  cuStreamDestroy
                    0.01%  79.139us       460     172ns     136ns     822ns  cuCtxGetDevice
                    0.01%  54.332us       110     493ns     432ns  3.3140us  cuEventRecord
                    0.01%  46.309us       128     361ns     244ns     907ns  cuEventCreate
                    0.01%  42.095us       110     382ns     355ns  1.3010us  cuIpcGetMemHandle
                    0.01%  39.017us       110     354ns     302ns  2.7690us  cuPointerGetAttributes
                    0.01%  37.881us       128     295ns     232ns  1.0330us  cuEventDestroy
                    0.01%  34.031us       112     303ns     278ns     983ns  cuMemGetAddressRange
                    0.01%  30.684us       112     273ns     239ns  1.9910us  cuPointerSetAttribute
                    0.00%  21.809us       115     189ns     155ns  1.6400us  cuDeviceGetCount
                    0.00%  21.661us       130     166ns     155ns     401ns  cuDevicePrimaryCtxGetState
                    0.00%  20.980us        16  1.3110us     763ns  5.4060us  cuDeviceGetPCIBusId
                    0.00%  19.123us       130     147ns     137ns     682ns  cuCtxGetCurrent
                    0.00%  7.5260us         1  7.5260us  7.5260us  7.5260us  cudaSetDevice
                    0.00%  6.2510us        32     195ns     117ns     445ns  cuDeviceGet
                    0.00%  4.8820us         7     697ns     128ns  3.1100us  cudaGetDeviceCount
                    0.00%  2.9790us        16     186ns     150ns     301ns  cuDeviceGetUuid

[1]-  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000000
[2]+  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm `which nvprof` ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000000

10K:

==304779== NVPROF is profiling process 304779, command: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
==304743== NVPROF is profiling process 304743, command: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
           100     0.000    46.432    46.432     205.39     205.39       21537       21537
==304743== Profiling application: ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==304743== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   82.45%  448.56ms         2  224.28ms  9.8690us  448.55ms  cudaMalloc
                   12.97%  70.563ms      1552  45.465us     104ns  2.9980ms  cuDeviceGetAttribute
                    3.30%  17.961ms        16  1.1226ms  840.76us  1.3979ms  cuDeviceTotalMem
                    1.06%  5.7766ms        16  361.04us  316.69us  828.20us  cuDeviceGetName
                    0.17%  937.56us         2  468.78us  28.030us  909.53us  cudaFree
                    0.01%  36.815us       110     334ns     274ns  3.2790us  cuIpcGetMemHandle
                    0.01%  36.798us       110     334ns     280ns  2.3640us  cuPointerGetAttributes
                    0.01%  33.805us       112     301ns     264ns  1.0570us  cuMemGetAddressRange
                    0.01%  27.361us       112     244ns     199ns  2.3790us  cuPointerSetAttribute
                    0.00%  25.180us        16  1.5730us     707ns  5.8900us  cuDeviceGetPCIBusId
                    0.00%  23.105us       126     183ns     152ns     291ns  cuDeviceGetUuid
                    0.00%  18.990us       112     169ns     158ns     278ns  cuCtxGetDevice
                    0.00%  8.6330us         1  8.6330us  8.6330us  8.6330us  cudaSetDevice
                    0.00%  6.5090us         7     929ns     151ns  3.6160us  cudaGetDeviceCount
                    0.00%  6.2310us        32     194ns     110ns     421ns  cuDeviceGet
                    0.00%  2.3970us         5     479ns     230ns     818ns  cuDeviceGetCount
                    0.00%     834ns         2     417ns     324ns     510ns  cuCtxGetCurrent
                    0.00%     721ns         2     360ns     222ns     499ns  cuDevicePrimaryCtxGetState
==304779== Profiling application: ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000
==304779== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  248.64us       110  2.2600us  2.2080us  2.7840us  [CUDA memcpy PtoP]
      API calls:   82.24%  448.74ms         2  224.37ms  9.6500us  448.73ms  cudaMalloc
                   12.55%  68.466ms      1552  44.114us     103ns  2.9920ms  cuDeviceGetAttribute
                    3.28%  17.922ms        16  1.1201ms  818.38us  1.4155ms  cuDeviceTotalMem
                    1.08%  5.8701ms        16  366.88us  273.04us  1.0082ms  cuDeviceGetName
                    0.20%  1.1168ms         2  558.41us  31.612us  1.0852ms  cudaFree
                    0.15%  841.04us         1  841.04us  841.04us  841.04us  cuIpcOpenMemHandle
                    0.10%  549.31us        16  34.331us  1.7210us  501.79us  cuStreamCreate
                    0.10%  540.45us         1  540.45us  540.45us  540.45us  cuIpcCloseMemHandle
                    0.10%  527.22us       110  4.7920us  4.3420us  30.328us  cuMemcpyDtoDAsync
                    0.09%  514.88us       785     655ns     372ns  10.039us  cuEventQuery
                    0.02%  87.440us        16  5.4650us  4.5070us  14.245us  cuStreamDestroy
                    0.02%  81.989us       461     177ns     131ns     568ns  cuCtxGetDevice
                    0.01%  54.709us       129     424ns     261ns  5.2310us  cuEventCreate
                    0.01%  52.930us       110     481ns     407ns  3.1770us  cuEventRecord
                    0.01%  39.507us       110     359ns     297ns  1.3080us  cuIpcGetMemHandle
                    0.01%  38.424us       110     349ns     286ns  2.9290us  cuPointerGetAttributes
                    0.01%  36.447us       112     325ns     288ns     770ns  cuMemGetAddressRange
                    0.01%  36.214us       129     280ns     226ns     666ns  cuEventDestroy
                    0.01%  29.974us       112     267ns     227ns  2.2360us  cuPointerSetAttribute
                    0.00%  22.874us        16  1.4290us     842ns  4.5790us  cuDeviceGetPCIBusId
                    0.00%  21.202us       131     161ns     148ns     560ns  cuDevicePrimaryCtxGetState
                    0.00%  21.128us       115     183ns     154ns     764ns  cuDeviceGetCount
                    0.00%  18.667us       131     142ns     135ns     533ns  cuCtxGetCurrent
                    0.00%  7.3690us        32     230ns     113ns     492ns  cuDeviceGet
                    0.00%  5.8050us         1  5.8050us  5.8050us  5.8050us  cudaSetDevice
                    0.00%  5.1700us         7     738ns     138ns  3.0220us  cudaGetDeviceCount
                    0.00%  3.3550us        16     209ns     154ns     336ns  cuDeviceGetUuid

[1]-  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm `which nvprof` ./bin/ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
[2]+  Done                    UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm `which nvprof` ./bin/ucx_perftest dgx2-02 -t tag_bw -m cuda -n 100 -s 10000

Lmk if I've missed something or made some mistake somewhere.

The comment from @jakirkham is interesting because UCX assumes that the current CUDA context is the one which was used to allocate memory resources (and tries to obtain memory handles with respect to that). If you allocated memory using a different context (possibly in a different thread), then calling getmemhandle would fail and you would end up not using cuda_ipc effectively. Could that be happening for the dask case?

jakirkham commented 4 years ago

Yes that sounds plausible. Thanks for that insight Akshay!

So we should change the current context back to the one used for allocating before sending, is that right? Or is there a way for us to inform UCX of the context it should use?

Akshay-Venkatesh commented 4 years ago

Yes that sounds plausible. Thanks for that insight Akshay!

So we should change the current context back to the one used for allocating before sending, is that right? Or is there a way for us to inform UCX of the context it should use?

There isn't a way today to indicate to UCX (through a parameter for instance) the context used for allocating memory. The expectation is that the calling thread have the specific context at the top of its stack before making the UCX call.

jakirkham commented 4 years ago

After some digging it seems RMM (our memory pool manager) only uses the runtime API. So it doesn't have a context itself. I'm not sure if there is a good way to get the context when using the runtime API or from the memory allocations themselves. Suggestions welcome 🙂

jakirkham commented 4 years ago

Actually we might be able to query this information from the Driver API using cuPointerGetAttribute. Would it be possible to handle/add this within UCX itself?

pentschev commented 4 years ago

Could you let me know what version of UCX you're using? I ran again the same as you and I still see DtoH/HtoD for 10KB, using the same command (except adapting for different paths in my case):

$ UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm /usr/local/cuda-10.1/bin/nvprof ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
[1580821729.763074] [dgx11:50030:0]       perftest.c:1416 UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
Waiting for connection...
+------------------------------------------------------------------------------------------+
| API:          protocol layer                                                             |
| Test:         tag match bandwidth                                                        |
| Data layout:  (automatic)                                                                |
| Send memory:  cuda                                                                       |
| Recv memory:  cuda                                                                       |
| Message size: 10000                                                                      |
+------------------------------------------------------------------------------------------+
==50030== NVPROF is profiling process 50030, command: ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==50030== Profiling application: ucx_perftest -t tag_bw -m cuda -n 100 -s 10000
==50030== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  458.70us       220  2.0850us  1.6950us  2.8790us  [CUDA memcpy HtoD]
      API calls:   88.06%  507.63ms         2  253.82ms  9.2140us  507.63ms  cudaMalloc
                    6.80%  39.174ms       776  50.481us     119ns  3.2212ms  cuDeviceGetAttribute
                    3.38%  19.465ms         8  2.4332ms  1.2599ms  4.1953ms  cuDeviceTotalMem
                    1.03%  5.9627ms         8  745.34us  291.41us  1.3211ms  cuDeviceGetName
                    0.34%  1.9565ms       220  8.8930us  8.1550us  23.290us  cudaStreamSynchronize
                    0.22%  1.2764ms       220  5.8010us  3.9990us  24.022us  cudaMemcpyAsync
                    0.15%  852.10us         2  426.05us  22.697us  829.40us  cudaFree
                    0.01%  49.537us       110     450ns     330ns  6.3410us  cuPointerGetAttributes
                    0.01%  39.525us       112     352ns     248ns  7.2160us  cuPointerSetAttribute
                    0.00%  18.755us         1  18.755us  18.755us  18.755us  cudaSetDevice
                    0.00%  16.804us         1  16.804us  16.804us  16.804us  cudaStreamDestroy
                    0.00%  14.911us         1  14.911us  14.911us  14.911us  cudaStreamCreateWithFlags
                    0.00%  12.799us         8  1.5990us  1.2380us  3.3530us  cuDeviceGetPCIBusId
                    0.00%  5.8330us        16     364ns     179ns     653ns  cuDeviceGet
                    0.00%  3.7620us         2  1.8810us     540ns  3.2220us  cuMemGetAddressRange
                    0.00%  3.3320us         5     666ns     139ns  1.1820us  cuDeviceGetCount
                    0.00%  2.3820us         5     476ns     124ns     759ns  cudaGetDeviceCount
                    0.00%  2.3790us         8     297ns     210ns     702ns  cuDeviceGetUuid
                    0.00%  1.4730us         2     736ns     456ns  1.0170us  cuCtxGetCurrent
                    0.00%  1.0200us         2     510ns     330ns     690ns  cuDevicePrimaryCtxGetState
                    0.00%     726ns         2     363ns     253ns     473ns  cuCtxGetDevice

$ UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm /usr/local/cuda-10.1/bin/nvprof ucx_perftest dgx11 -t tag_bw -m cuda -n 100 -s 10000
[1580821735.327625] [dgx11:50112:0]       perftest.c:1416 UCX  WARN  CPU affinity is not set (bound to 80 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
==50112== NVPROF is profiling process 50112, command: ucx_perftest dgx11 -t tag_bw -m cuda -n 100 -s 10000
           100     0.000    61.769    61.769     154.39     154.39       16189       16189
==50112== Profiling application: ucx_perftest dgx11 -t tag_bw -m cuda -n 100 -s 10000
==50112== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  412.59us       220  1.8750us  1.6960us  3.5200us  [CUDA memcpy DtoH]
      API calls:   88.09%  500.50ms         2  250.25ms  8.6370us  500.49ms  cudaMalloc
                    7.78%  44.220ms       776  56.984us     119ns  4.1772ms  cuDeviceGetAttribute
                    2.65%  15.044ms         8  1.8805ms  1.1896ms  2.5819ms  cuDeviceTotalMem
                    0.66%  3.7769ms       220  17.167us  15.858us  36.109us  cudaMemcpyAsync
                    0.55%  3.1409ms         8  392.62us  119.54us  888.50us  cuDeviceGetName
                    0.13%  717.44us         2  358.72us  16.247us  701.20us  cudaFree
                    0.10%  554.59us       220  2.5200us  2.3380us  11.097us  cudaStreamSynchronize
                    0.02%  86.939us       110     790ns     649ns  2.7950us  cuPointerGetAttributes
                    0.01%  40.389us       112     360ns     335ns  1.3690us  cuPointerSetAttribute
                    0.00%  21.360us        16  1.3350us     169ns  17.349us  cuDeviceGet
                    0.00%  19.180us         1  19.180us  19.180us  19.180us  cudaStreamDestroy
                    0.00%  19.126us         1  19.126us  19.126us  19.126us  cudaStreamCreateWithFlags
                    0.00%  12.666us         8  1.5830us  1.2010us  2.8990us  cuDeviceGetPCIBusId
                    0.00%  5.0400us         1  5.0400us  5.0400us  5.0400us  cudaSetDevice
                    0.00%  2.9710us         5     594ns     191ns     888ns  cuDeviceGetCount
                    0.00%  2.6390us         5     527ns     159ns     796ns  cudaGetDeviceCount
                    0.00%  2.3940us         2  1.1970us     397ns  1.9970us  cuMemGetAddressRange
                    0.00%  2.1490us         8     268ns     231ns     383ns  cuDeviceGetUuid
                    0.00%  2.0860us         2  1.0430us     522ns  1.5640us  cuCtxGetCurrent
                    0.00%  1.4850us         2     742ns     332ns  1.1530us  cuDevicePrimaryCtxGetState
                    0.00%     707ns         2     353ns     252ns     455ns  cuCtxGetDevice

The output above is from a DGX-1, but I see the same on a DGX-2. I've been using UCX master (commit 1c75cfdc377680a90b613deabbcca61fb0c050f7) and tried upgrading to latest (commit 5ef72d9f2a3648b19b80333c1ff496987173b9d1), but see the same result in both cases.

Regarding the the context discussion: we fixed a lot of context issues with Dask, where threads would be spawned before a CUDA context was created, which meant that the new thread didn't have a context assigned and would ultimately lead to a crash in UCX. All these issues have been fixed, otherwise we would be seeing crashes all the time. IOW, I don't think there's any CUDA context related issues anymore. We're creating a context for each device at Dask startup, so unless someone is mistakenly creating a context during runtime, I don't believe there is more than one context created at any time.

quasiben commented 4 years ago

I am seeing the same as @pentschev on both master and v1.7. I built master with following configuration:

../contrib/configure-devel --prefix=$CONDA_PREFIX --with-cuda=$CUDA_HOME --enable-mt CPPFLAGS="-I/$CUDA_HOME/include"

MattBBaker commented 4 years ago

Been reading through a lot of UCX cuda code, and I have an idea. Be warned it maybe a bad idea, don't know enough about cuda and ipc transfers. It is possible to register memory with UCX via ucp_mem_map(), it's usually used with RMA operations to register memory with the NIC. I wonder if it would help the performance to have RMM register the cuda memory pools with UCX itself and act as a cache at a higher level. My main fear would be if that would confuse the rendezvous pipeline with the memory pools already registered.

I've gone well past what I know about cuda though, so I could be horribly wrong, but wanted to float UCX APIs that could be useful.

pentschev commented 4 years ago

@quasiben and I found that manually setting UCX_RNDV_THRESH to a really low value (e.g., 1) works around the issue. IOW, all worker-worker transfers seem to be going over PtoP in that situation. It also solves those cases we mentioned above where depending on configuration size of 10000 would not go over PtoP.

The transports enabled have an effect on the rendezvous threshold when it's set to auto (default). For example, on a DGX-1, having UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc will show ucp_ep.c:1222 UCX TRACE rndv threshold is 474127 (send_nbr: 262144), but if we add mm in UCX_TLS as well, then we'll see ucp_ep.c:1222 UCX TRACE rndv threshold is 8256 (send_nbr: 262144).

I checked that setting the threshold indeed decreases the runtime to 18 seconds (from about 22 before), but it's still slower than Python sockets for this particular example. I wasn't yet able to give a better look at trace and profiling it, but will do that tomorrow.

pentschev commented 4 years ago

Thanks also @MattBBaker for the suggestion. I think the way CUDA memory pools are handled today with UCX seems to be working fine. I was already able to confirm that the UCX's CUDA IPC cache is performing well (no unmapping/remapping of handles) as long as we do use a memory pool (i.e., not using one is generally very slow due to the frequent opening and closing of handles). I'm not totally sure this is an answer to your suggestion though.

jakirkham commented 4 years ago

@quasiben and I found that manually setting UCX_RNDV_THRESH to a really low value (e.g., 1) works around the issue.

Thanks @pentschev! 😄

Should we bake this into dask-cuda or ucx-py as a workaround near term? Would this address the same need as PR ( https://github.com/rapidsai/ucx-py/pull/406 ) or is there still more to do to make things functional near term?

cc @madsbk

pentschev commented 4 years ago

I'm not sure yet @jakirkham . I'm doing some profiling still to verify that everything works correctly, plus I'll need to check whether this affects other transports, so it may take a while until we're confident this is the right solution. There's still some lower performance when compared to TCP and I'm currently investigating that.

rapidsai / ucx-py

Dask-cudf multi partition merge slows down with `ucx` #402

Dask-cudf multi partition merge slows down with ucx .

Wall Times on exp-01

Repro Code:

Merge Code:

Additional Context:

Dask-cudf multi partition merge slows down with `ucx` .