Closed VibhuJawa closed 3 years ago
Any chance of seeing where on that timeline you see the slowdown?
My guess (though will try to back this up with data) is it lines up with where we see the very large spikes. The first half is mostly loading data and the second half is performing the merge. This is also where all of the data transfer is happening.
There are workers that see a few larger allocations, but this seems to be a rare occurrence.
Looking over all of these the mean is ~0.5MB, the standard deviation ~1MB, the median is ~0.25MB, and the max ~5MB. The standard deviation and mean seem to be pretty large based on these infrequent, large allocations.
Is this example representative of important workloads? I am surprised that the largest allocation in a big data application is 5MB. What is the peak memory usage in bytes?
For the last histogram, I'd like to see a more detailed breakdown of that bottom bin. Is it mostly <1KB allocations like in the first histogram? Or are they closer to 500KB?
Is this example representative of important workloads? I am surprised that the largest allocation in a big data application is 5MB.
It is representative of this particular problem, but granted this is on a small scale dataset. I'm hoping we'll see larger allocations for larger problems. However, there's a bunch of small allocations regardless that happen probably in CUB when it's doing some DtoH<->HtoD transfers, probably transferring some 32 bit counters or something similar.
@jakirkham you asked for reports from Azure, here they are:
TCP
UCX + NVLink
UCX + NVLink + InfiniBand (without nv_peer_mem
)
It looks like UCX + NVLink + InfiniBand is a bit faster, but is experiencing the memory allocation problem much worse than UCX + NVLink. It seems memory allocation is taking roughly twice as much time in the former than the latter. Though I would expect both perform the same number of allocations. Why should one take so much longer? 🤔
It looks like UCX + NVLink + InfiniBand is a bit faster, but is experiencing the memory allocation problem much worse than UCX + NVLink. It seems memory allocation is taking roughly twice as much time in the former than the latter. Though I would expect both perform the same number of allocations. Why should one take so much longer? 🤔
At this point I think we should start getting some NSight profiles and looking at them to see where we're spending the time. If we need to add additional nvtx ranges into cuDF / RMM we can.
Keith,
Before you spend more time on this, I found the cause for the slowdown with nv peer mem enabled -- repeated IB registrations. I've shared a patch which avoids repeated IB registrations. With this patch, Peter is able to see the slowdown go away. I think Peter will add more details here soon.
On Sat, Feb 15, 2020, 2:48 PM Keith Kraus notifications@github.com wrote:
It looks like UCX + NVLink + InfiniBand is a bit faster, but is experiencing the memory allocation problem much worse than UCX + NVLink. It seems memory allocation is taking roughly twice as much time in the former than the latter. Though I would expect both perform the same number of allocations. Why should one take so much longer? 🤔
At this point I think we should start getting some NSight profiles and looking at them to see where we're spending the time. If we need to add additional nvtx ranges into cuDF / RMM we can.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rapidsai/ucx-py/issues/402?email_source=notifications&email_token=AAYNWAUABXD25VX3QF2BHKLRDBWMXA5CNFSM4KM5MV42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL3YYHA#issuecomment-586648604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYNWATJUWEYI5K6QMFNBPTRDBWMXANCNFSM4KM5MV4Q .
Which slowdown? The overall reason for this issue? Or the slowdown of UCX + NVLink + IB vs. UCX + NVLink?
I think there are some confusions, let me attempt to clarify:
Finally, there has been a general trend assuming a large part of slowdowns comes from memory allocation. While that may be true, I think we need to be extra careful on interpreting our profile reports as all of them are limited in not including all the information we need (i.e., we don't have any profile that includes C++, CUDA and Python information all at once). For instance, I think one such assumptions is coming from the fact that rmm.DeviceBuffer
appears as one of most time-consuming operations, and that is a call coming from distributed.protocol.ucx.read
which hints it's not only the allocation that takes place, but the copy as well, and the copy may include additional overhead, such as (but not limited to) cuIpcOpenMemHandle
if that's the first time a worker is reading another worker's memory pool data.
While I'm not saying the above is actually happening, we need to be careful to identify exactly what's going on, since we don't have the full picture yet, we may be looking at the wrong piece of the puzzle.
I think there are some confusions, let me attempt to clarify:
- The overall slowdown for this issue refers to UCX being generally slower (independent of transport) than TCP via Python sockets -- this remains an issue that is possibly on the Python side but still unconfirmed and thus we have people working on many fronts;
- The patch Akshay is referring improves InfiniBand specifically which is important for us, but not the main topic of this issues.
Finally, there has been a general trend assuming a large part of slowdowns comes from memory allocation. While that may be true, I think we need to be extra careful on interpreting our profile reports as all of them are limited in not including all the information we need (i.e., we don't have any profile that includes C++, CUDA and Python information all at once). For instance, I think one such assumptions is coming from the fact that
rmm.DeviceBuffer
appears as one of most time-consuming operations, and that is a call coming fromdistributed.protocol.ucx.read
which hints it's not only the allocation that takes place, but the copy as well, and the copy may include additional overhead, such as (but not limited to)cuIpcOpenMemHandle
if that's the first time a worker is reading another worker's memory pool data.While I'm not saying the above is actually happening, we need to be careful to identify exactly what's going on, since we don't have the full picture yet, we may be looking at the wrong piece of the puzzle.
Thanks much for the summary, @pentschev
Have rerun using the same setup as before.
One small tweak was to bump up the RMM pool by about 2GB (so 28GB total).
The main difference here is that TCP now avoids pickling the data as it uses the "dask"
serializers, which are basically the same as the "ucx"
serializers. The "dask"
serializers incur a penalty due to moving the data to/from host memory before/after transmission by TCP (UCX should avoid this data movement). Though this penalty on TCP was already implicitly present before due to pickling (which did additional work after moving the data to host).
Here are the overall runtimes extracted from the profile plots:
Machine | Protocol | Duration (s) |
---|---|---|
DGX-1 | TCP | 61.68 |
DGX-1 | UCX | 71.33 |
DGX-2 | TCP | 29.94 |
DGX-2 | UCX | 29.81 |
Here are the profiles I got for the DGX-1 and DGX-2 using TCP and UCX:
One small tweak was to bump up the RMM pool by about 2GB (so 28GB total).
Machine Protocol Duration (s) DGX-1 TCP 61.68 DGX-1 UCX 71.33 DGX-2 TCP 29.94 DGX-2 UCX 29.81
From the issue description this seems like a slow down for tcp
no as when i tested this on ?
See timing in the issue which states tcp used to take 15.7 s
on dgx-2
.
It's worth noting those numbers come from the Dask Performance Reports themselves not %time
. So am not exactly sure how they relate.
So the previous reports did include the creation time, updated times for the merge are:
DGX-2 TCP : 14.7 s
DGX-2 UCX : 18.6 s
With setting UCX_RNDV_THRESH=8192
the time is brought down to 14.1 s
.
So there may be issues from my build or my environment generally. It would be good to get others to replicate this and see what they get. However this is what I'm seeing using the suballocator work on a DGX-1. This uses UCX_RNDV_THRESH=8192
.
Machine | Protocol | Duration (s) |
---|---|---|
DGX-1 | TCP | 32 |
DGX-1 | UCX | 37.5 |
So there may be issues from my build or my environment generally. It would be good to get others to replicate this and see what they get. However this is what I'm seeing using the suballocator work on a DGX-1. This uses
UCX_RNDV_THRESH=8192
.Machine Protocol Duration (s) DGX-1 TCP 32 DGX-1 UCX 37.5 [TCP Profile]
So the timings here include both creation and merging, can you try to profile just the merge and creation separately, as the sub-allocator
was supposed to help memory allocations that were happening during merge and provides context for that .
I'm going to investigate and retest today. There still seem to be some weird things here that I've seen in previous runs (despite all new packages). Maybe it's time to purge Miniconda and reinstall everything from scratch?
I checked how things are looking currently, and below are the results I got with RAPIDS 0.20 and UCX 1.9 on a DGX-2:
TCP:
%time len(merged_df)
CPU times: user 8.22 s, sys: 576 ms, total: 8.8 s
Wall time: 35.2 s
%time len(merged_df)
CPU times: user 8.26 s, sys: 924 ms, total: 9.18 s
Wall time: 35.3 s
UCX + NVLink:
%time len(merged_df)
CPU times: user 8.04 s, sys: 975 ms, total: 9.01 s
Wall time: 21 s
%time len(merged_df)
CPU times: user 7.86 s, sys: 857 ms, total: 8.72 s
Wall time: 20.4 s
It seems like there has been a regression in performance with TCP, and UCX is outperforming TCP now (as we would expect). From a UCX-Py side, I think the original issue here has been resolved over the past year, do you think we can close this @VibhuJawa ? It may be worth investigating the potential regression in this merge workflow, but at this point I don't think there's anything to be done on the UCX front.
I checked how things are looking currently, and below are the results I got with RAPIDS 0.20 and UCX 1.9 on a DGX-2:
TCP:
%time len(merged_df) CPU times: user 8.22 s, sys: 576 ms, total: 8.8 s Wall time: 35.2 s %time len(merged_df) CPU times: user 8.26 s, sys: 924 ms, total: 9.18 s Wall time: 35.3 s
UCX + NVLink:
%time len(merged_df) CPU times: user 8.04 s, sys: 975 ms, total: 9.01 s Wall time: 21 s %time len(merged_df) CPU times: user 7.86 s, sys: 857 ms, total: 8.72 s Wall time: 20.4 s
It seems like there has been a regression in performance with TCP, and UCX is outperforming TCP now (as we would expect). From a UCX-Py side, I think the original issue here has been resolved over the past year, do you think we can close this @VibhuJawa ? It may be worth investigating the potential regression in this merge workflow, but at this point I don't think there's anything to be done on the UCX front.
Thanks for following up on this Peter. Agree that we should close this on UCX-PY as there is no action item left on that front.
Might be worth exploring the TCP slow down though as the 2x slow don seems concerning (CC: @randerzander / @beckernick ) .
Dask-cudf multi partition merge slows down with
ucx
.Dask-cudf merge seems to slow down with
ucx
.Wall time: (15.4 seconds on tcp) vs (37.8 s on ucx) (exp-01)
In the attached example we see a slow down with
ucx
vs just usingtcp
.Wall Times on exp-01
UCX Time
TCP times
Repro Code:
Helper Function to create distributed dask-cudf frame
RMM Setup:
Merge Code:
The slow down happens on the merge step.
Additional Context:
There has been discussion about this on our internal slack channel, please see for more context.