Dask-cudf multi partition merge slows down with `ucx`

VibhuJawa commented 4 years ago

Dask-cudf multi partition merge slows down with `ucx` .

Dask-cudf merge seems to slow down with ucx .

Wall time: (15.4 seconds on tcp) vs (37.8 s on ucx) (exp-01)

In the attached example we see a slow down with ucx vs just using tcp .

Wall Times on exp-01

UCX Time

CPU times: user 19.3 s, sys: 1.97 s, total: 21.2 s
Wall time: 38.4 s
2945293

CPU times: user 16.7 s, sys: 1.71 s, total: 18.4 s
Wall time: 37.8 s
2943379

TCP times

CPU times: user 10.8 s, sys: 815 ms, total: 11.6 s
Wall time: 15.7 s
2944022

CPU times: user 10.9 s, sys: 807 ms, total: 11.7 s
Wall time: 15.4 s
2943697

Repro Code:

Helper Function to create distributed dask-cudf frame


import dask_cudf
import cudf
import os
import time
import dask.dataframe as dd
import dask.array as da

from dask_cuda import LocalCUDACluster
from dask.distributed import Client,wait
from dask.utils import parse_bytes

def create_random_data(n_rows=1_000,n_parts = 10, n_keys_index_1=100_000,n_keys_index_2=100,n_keys_index_3=100, col_prefix = 'a'):

    chunks = n_rows//n_parts

    df = dd.concat([
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_1'),
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_2'),
        da.random.random(n_rows, chunks = chunks).to_dask_dataframe(columns= col_prefix + '_non_merge_3'),
        da.random.randint(0, n_keys_index_1, size=n_rows,chunks = chunks ).to_dask_dataframe(columns= col_prefix + '_0'),
        da.random.randint(0, n_keys_index_2, size=n_rows, chunks = chunks ).to_dask_dataframe(columns= col_prefix +'_1'),
        da.random.randint(0, n_keys_index_3, size=n_rows, chunks = chunks ).to_dask_dataframe(columns= col_prefix +'_2'),

    ], axis=1).persist()

    gdf = df.map_partitions(cudf.from_pandas)
    gdf =  gdf.persist()
    _ = wait(gdf)
    return gdf

RMM Setup:

def setup_rmm_pool(client):
    client.run(
        cudf.set_allocator,
        pool=True,
        initial_pool_size= parse_bytes("26GB"),
        allocator="default"
    )
    return None

setup_rmm_pool(client)

Merge Code:

The slow down happens on the merge step.

rows_1, parts_1 = 140_176_770, 245
rows_2, parts_2 = 21_004_393, 171

df_1 = create_random_data(n_rows= rows_1, n_parts = parts_1, col_prefix = 'a')
df_2 = create_random_data(n_rows= rows_2,  n_parts = parts_2, col_prefix = 'b')

merged_df = df_1.merge(df_2, left_on = ['a_0','a_1','a_2'], right_on = ['b_0','b_1','b_2'])
%time len(merged_df)

Additional Context:

There has been discussion about this on our internal slack channel, please see for more context.

harrism commented 4 years ago

Any chance of seeing where on that timeline you see the slowdown?

jakirkham commented 4 years ago

My guess (though will try to back this up with data) is it lines up with where we see the very large spikes. The first half is mostly loading data and the second half is performing the merge. This is also where all of the data transfer is happening.

jakirkham commented 4 years ago

There are workers that see a few larger allocations, but this seems to be a rare occurrence.

Allocations (#) vs Size (bytes) 2

Alive Allocations (#) vs Time (Operations) 2

jakirkham commented 4 years ago

Looking over all of these the mean is ~0.5MB, the standard deviation ~1MB, the median is ~0.25MB, and the max ~5MB. The standard deviation and mean seem to be pretty large based on these infrequent, large allocations.

harrism commented 4 years ago

Is this example representative of important workloads? I am surprised that the largest allocation in a big data application is 5MB. What is the peak memory usage in bytes?

harrism commented 4 years ago

For the last histogram, I'd like to see a more detailed breakdown of that bottom bin. Is it mostly <1KB allocations like in the first histogram? Or are they closer to 500KB?

pentschev commented 4 years ago

Is this example representative of important workloads? I am surprised that the largest allocation in a big data application is 5MB.

It is representative of this particular problem, but granted this is on a small scale dataset. I'm hoping we'll see larger allocations for larger problems. However, there's a bunch of small allocations regardless that happen probably in CUB when it's doing some DtoH<->HtoD transfers, probably transferring some 32 bit counters or something similar.

pentschev commented 4 years ago

@jakirkham you asked for reports from Azure, here they are:

TCP UCX + NVLink UCX + NVLink + InfiniBand (without nv_peer_mem)

jakirkham commented 4 years ago

It looks like UCX + NVLink + InfiniBand is a bit faster, but is experiencing the memory allocation problem much worse than UCX + NVLink. It seems memory allocation is taking roughly twice as much time in the former than the latter. Though I would expect both perform the same number of allocations. Why should one take so much longer? 🤔

kkraus14 commented 4 years ago

It looks like UCX + NVLink + InfiniBand is a bit faster, but is experiencing the memory allocation problem much worse than UCX + NVLink. It seems memory allocation is taking roughly twice as much time in the former than the latter. Though I would expect both perform the same number of allocations. Why should one take so much longer? 🤔

At this point I think we should start getting some NSight profiles and looking at them to see where we're spending the time. If we need to add additional nvtx ranges into cuDF / RMM we can.

Akshay-Venkatesh commented 4 years ago

Keith,

Before you spend more time on this, I found the cause for the slowdown with nv peer mem enabled -- repeated IB registrations. I've shared a patch which avoids repeated IB registrations. With this patch, Peter is able to see the slowdown go away. I think Peter will add more details here soon.

On Sat, Feb 15, 2020, 2:48 PM Keith Kraus notifications@github.com wrote:

It looks like UCX + NVLink + InfiniBand is a bit faster, but is experiencing the memory allocation problem much worse than UCX + NVLink. It seems memory allocation is taking roughly twice as much time in the former than the latter. Though I would expect both perform the same number of allocations. Why should one take so much longer? 🤔

At this point I think we should start getting some NSight profiles and looking at them to see where we're spending the time. If we need to add additional nvtx ranges into cuDF / RMM we can.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rapidsai/ucx-py/issues/402?email_source=notifications&email_token=AAYNWAUABXD25VX3QF2BHKLRDBWMXA5CNFSM4KM5MV42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL3YYHA#issuecomment-586648604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYNWATJUWEYI5K6QMFNBPTRDBWMXANCNFSM4KM5MV4Q .

harrism commented 4 years ago

Which slowdown? The overall reason for this issue? Or the slowdown of UCX + NVLink + IB vs. UCX + NVLink?

pentschev commented 4 years ago

I think there are some confusions, let me attempt to clarify:

The overall slowdown for this issue refers to UCX being generally slower (independent of transport) than TCP via Python sockets -- this remains an issue that is possibly on the Python side but still unconfirmed and thus we have people working on many fronts;
The patch Akshay is referring improves InfiniBand specifically which is important for us, but not the main topic of this issues.

Finally, there has been a general trend assuming a large part of slowdowns comes from memory allocation. While that may be true, I think we need to be extra careful on interpreting our profile reports as all of them are limited in not including all the information we need (i.e., we don't have any profile that includes C++, CUDA and Python information all at once). For instance, I think one such assumptions is coming from the fact that rmm.DeviceBuffer appears as one of most time-consuming operations, and that is a call coming from distributed.protocol.ucx.read which hints it's not only the allocation that takes place, but the copy as well, and the copy may include additional overhead, such as (but not limited to) cuIpcOpenMemHandle if that's the first time a worker is reading another worker's memory pool data.

While I'm not saying the above is actually happening, we need to be careful to identify exactly what's going on, since we don't have the full picture yet, we may be looking at the wrong piece of the puzzle.

Akshay-Venkatesh commented 4 years ago

I think there are some confusions, let me attempt to clarify:

The overall slowdown for this issue refers to UCX being generally slower (independent of transport) than TCP via Python sockets -- this remains an issue that is possibly on the Python side but still unconfirmed and thus we have people working on many fronts;

The patch Akshay is referring improves InfiniBand specifically which is important for us, but not the main topic of this issues.

Finally, there has been a general trend assuming a large part of slowdowns comes from memory allocation. While that may be true, I think we need to be extra careful on interpreting our profile reports as all of them are limited in not including all the information we need (i.e., we don't have any profile that includes C++, CUDA and Python information all at once). For instance, I think one such assumptions is coming from the fact that rmm.DeviceBuffer appears as one of most time-consuming operations, and that is a call coming from distributed.protocol.ucx.read which hints it's not only the allocation that takes place, but the copy as well, and the copy may include additional overhead, such as (but not limited to) cuIpcOpenMemHandle if that's the first time a worker is reading another worker's memory pool data.

While I'm not saying the above is actually happening, we need to be careful to identify exactly what's going on, since we don't have the full picture yet, we may be looking at the wrong piece of the puzzle.

Thanks much for the summary, @pentschev

jakirkham commented 4 years ago

Have rerun using the same setup as before.

One small tweak was to bump up the RMM pool by about 2GB (so 28GB total).

The main difference here is that TCP now avoids pickling the data as it uses the "dask" serializers, which are basically the same as the "ucx" serializers. The "dask" serializers incur a penalty due to moving the data to/from host memory before/after transmission by TCP (UCX should avoid this data movement). Though this penalty on TCP was already implicitly present before due to pickling (which did additional work after moving the data to host).

Here are the overall runtimes extracted from the profile plots:

Machine	Protocol	Duration (s)
DGX-1	TCP	61.68
DGX-1	UCX	71.33
DGX-2	TCP	29.94
DGX-2	UCX	29.81

Here are the profiles I got for the DGX-1 and DGX-2 using TCP and UCX:

VibhuJawa commented 4 years ago

One small tweak was to bump up the RMM pool by about 2GB (so 28GB total).

Machine Protocol Duration (s) DGX-1 TCP 61.68 DGX-1 UCX 71.33 DGX-2 TCP 29.94 DGX-2 UCX 29.81

From the issue description this seems like a slow down for tcp no as when i tested this on ? See timing in the issue which states tcp used to take 15.7 s on dgx-2.

jakirkham commented 4 years ago

It's worth noting those numbers come from the Dask Performance Reports themselves not %time. So am not exactly sure how they relate.

VibhuJawa commented 4 years ago

So the previous reports did include the creation time, updated times for the merge are:

DGX-2 TCP : 14.7 s

DGX-2 UCX : 18.6 s

TCP Profile

UCX Profile

With setting UCX_RNDV_THRESH=8192 the time is brought down to 14.1 s.

UCX Profile 8192

jakirkham commented 4 years ago

So there may be issues from my build or my environment generally. It would be good to get others to replicate this and see what they get. However this is what I'm seeing using the suballocator work on a DGX-1. This uses UCX_RNDV_THRESH=8192.

Machine	Protocol	Duration (s)
DGX-1	TCP	32
DGX-1	UCX	37.5

TCP Profile

UCX Profile

VibhuJawa commented 4 years ago

So there may be issues from my build or my environment generally. It would be good to get others to replicate this and see what they get. However this is what I'm seeing using the suballocator work on a DGX-1. This uses UCX_RNDV_THRESH=8192.

Machine Protocol Duration (s) DGX-1 TCP 32 DGX-1 UCX 37.5 [TCP Profile]

So the timings here include both creation and merging, can you try to profile just the merge and creation separately, as the sub-allocator was supposed to help memory allocations that were happening during merge and provides context for that .

jakirkham commented 4 years ago

I'm going to investigate and retest today. There still seem to be some weird things here that I've seen in previous runs (despite all new packages). Maybe it's time to purge Miniconda and reinstall everything from scratch?

pentschev commented 3 years ago

I checked how things are looking currently, and below are the results I got with RAPIDS 0.20 and UCX 1.9 on a DGX-2:

TCP:

%time len(merged_df)
CPU times: user 8.22 s, sys: 576 ms, total: 8.8 s
Wall time: 35.2 s

%time len(merged_df)
CPU times: user 8.26 s, sys: 924 ms, total: 9.18 s
Wall time: 35.3 s

UCX + NVLink:

%time len(merged_df)
CPU times: user 8.04 s, sys: 975 ms, total: 9.01 s
Wall time: 21 s

%time len(merged_df)
CPU times: user 7.86 s, sys: 857 ms, total: 8.72 s
Wall time: 20.4 s

It seems like there has been a regression in performance with TCP, and UCX is outperforming TCP now (as we would expect). From a UCX-Py side, I think the original issue here has been resolved over the past year, do you think we can close this @VibhuJawa ? It may be worth investigating the potential regression in this merge workflow, but at this point I don't think there's anything to be done on the UCX front.

VibhuJawa commented 3 years ago

I checked how things are looking currently, and below are the results I got with RAPIDS 0.20 and UCX 1.9 on a DGX-2:

TCP:
%time len(merged_df)
CPU times: user 8.22 s, sys: 576 ms, total: 8.8 s
Wall time: 35.2 s

%time len(merged_df)
CPU times: user 8.26 s, sys: 924 ms, total: 9.18 s
Wall time: 35.3 s
UCX + NVLink:
%time len(merged_df)
CPU times: user 8.04 s, sys: 975 ms, total: 9.01 s
Wall time: 21 s

%time len(merged_df)
CPU times: user 7.86 s, sys: 857 ms, total: 8.72 s
Wall time: 20.4 s
It seems like there has been a regression in performance with TCP, and UCX is outperforming TCP now (as we would expect). From a UCX-Py side, I think the original issue here has been resolved over the past year, do you think we can close this @VibhuJawa ? It may be worth investigating the potential regression in this merge workflow, but at this point I don't think there's anything to be done on the UCX front.

Thanks for following up on this Peter. Agree that we should close this on UCX-PY as there is no action item left on that front.

Might be worth exploring the TCP slow down though as the 2x slow don seems concerning (CC: @randerzander / @beckernick ) .

rapidsai / ucx-py