Closed VibhuJawa closed 3 years ago
Thanks Peter! No worries. It's better to be confident before it proliferates.
Well there's also still the overhead of the send/recv as Ben has pointed out as well. Have a few things in the works (most of which I think you've seen), which should help.
Added PR ( https://github.com/rapidsai/cudf/pull/4077 ) and PR ( https://github.com/dask/distributed/pull/3442 ), which should cut out the overhead involved in serializing as seen in the worker administrative profile earlier.
Added PR ( rapidsai/cudf#4077 ) and PR ( dask/distributed#3442 ), which should cut out the overhead involved in serializing as seen in the worker administrative profile earlier.
Did you happen to profile that? Could you share some numbers, preferably for this workflow?
Not yet no. We have been trying a few approaches for the cuDF change based on feedback, impact, etc. So probably a little early to profile, but agree that would be a good thing to do.
Should add I'm guessing PR ( https://github.com/dask/distributed/pull/3453 ) will have more impact, but will need to work on it a bit more before we have something usable. As mentioned earlier the plan is to work @madsbk on profiling this.
@jakirkham I would also add that if that PR turns out to be a nightmare there's likely options at the cudf serialization level / the algorithm level where we can make things return into one contiguous allocation that can be sent instead of breaking it down to the Column / Buffer level like we currently do.
Sorry for not updating this yet @kkraus14.
I think PR ( https://github.com/rapidsai/cudf/pull/4101 ) might be more approachable near term. Just waiting for CI to report. If you have a chance to look, that would be great (no pressure though). π
So about a week ago we had a meeting where people decided that it would be interesting to make sure that we were usIng RMM in the ucx-py code. Did someone run that experiment? If so, what was the result? It seems like people have moved on from that so I suspect that there was a finding. I would be curious if anyone has results to share.
Previously we were seeing Numba popping up quite a bit in the worker administrative profile in Dask. Mainly this was related to grabbing the context. This happened during things like grabbing __cuda_array_interface__
(with Distributed and UCX-Py) and serialization of cuDF data (as it creates Numba DeviceNDArray
s internally to send). It also happened in spilling (due to costly creation and destruction), but was less obvious (at least from the profile).
We have since fixed the spilling issue by reading and writing directly to RMM DeviceBuffer
s that have much less overhead to work with. ( https://github.com/rapidsai/dask-cuda/pull/235 ) This is now in dask-cuda nightlies. As to cuDF serialization, there is a PR to make cuDF Buffer
objects themselves serializable ( https://github.com/rapidsai/cudf/pull/4101 ), which eliminates the remaining Numba overhead.
After quickly running some tests, Numba is not featured in the worker profile at all. This isn't to say there are no more improvements to be made here (suspect there is still plenty of room to improve). That said, figured this step forward was worth sharing. It would be good if people can start profiling with these changes and report back.
Edit: We've now extended these improvements to StringColumn
s. ( https://github.com/rapidsai/cudf/pull/4111 )
I would love to see a performance report with the changes if anyone has access to a machine and the time to run the previous benchmark.
Yep that's a good idea @mrocklin. Just have been pushing in the last related fixes and getting some nightlies out to test with.
Have rerun on a DGX-1 with the latest nightlies and a ucx-py fix that @madsbk recently made (should produce nightlies soon). Also set UCX_RNDV_THRESHOLD=8192
for the UCX case. Ran using a notebook included the Gist (please point out errors anyone if you see any π). Here are the profiles I got for TCP and UCX.
What about the time to execute the merge operation? I would be interested in seeing that before anything else.
Should add when looking at the UCX profile, I'm seeing a lot of time spent in .is_closed(...)
. Admittedly this might just be hiding other things happening at the C level, but wanted to point it out in case there was something else we should be doing here π
What about the time to execute the merge operation? I would be interested in seeing that before anything else.
What do you mean? The overall runtime? If so, UCX remains a lot slower than TCP. So then the question is, "why?"
Should add I ran this on a DGX-1 as the DGX-2s are pretty heavily occupied ATM.
What do you mean? The overall runtime? If so, UCX remains a lot slower than TCP. So then the question is, "why?"
Yes, I mean the overall/wall time on a DGX-2. I think this will give us some insight on potential speedups achieved, and hopefully present no regressions on performance either. My current best time runs around 16 seconds on a DGX-2 with NVLink and UCX_RNDV_THRESH=8192
, which is still around 1 second slower than TCP.
Really? Only 1 second slower? Maybe I'm doing something wrong. Do you mind looking at my notebook briefly?
Have rerun on a DGX-2 using the same setup as before.
Here are the overall runtimes (based on %time
in the notebook):
Protocol | CPU time (user) | CPU time (sys) | CPU time (total) | Wall time |
---|---|---|---|---|
TCP | 11.7 s | 965 ms | 12.6 s | 14.6 s |
UCX | 4min 37s | 20.5 s | 4min 58s | 25min 5s |
Here are the profiles I got for TCP and UCX:
Running again, it's actually 2 seconds slower:
TCP Merge time: 14.165126085281372
UCX+NVLink Merge time Run 1: 16.27089762687683
UCX+NVLink Merge time Run 2: 15.756562232971191
UCX+NVLink Merge time Run 3: 16.542734384536743
Your code seems right, but 25 minutes as above seems absolutely off. I honestly don't know what could have happened, but I would say anything above 1 minute (and that's already very stretched) has something wrong in it.
EDIT: Just for a reminder from https://github.com/rapidsai/ucx-py/issues/402#issuecomment-579986636, my previous best without UCX_RNDV_THRESH=8192
was around 22 seconds.
Based on @pentschevβs debugging internally, it appears my Conda environment was hosed. Weβre working on new reports.
As discussed offline, it seems that @jakirkham 's environment has something off, I've tested that same environment and could confirm UCX was taking absurdly long.
I now created a new environment as follows:
conda create -n rapids-nightly-0.13 -c rapidsai-nightly -c nvidia -c conda-forge -c defaults cudatoolkit=10.1 rapids=0.13 python=3.7
And for the first time I saw better results for UCX (with NVLink) compared to TCP:
TCP Merge time Run 1: 15.855406522750854
TCP Merge time Run 2: 15.03106141090393
TCP Merge time Run 3: 14.64222264289856
UCX+NVLink Merge time Run 1: 13.793559312820435
UCX+NVLink Merge time Run 2: 13.05189847946167
UCX+NVLink Merge time Run 3: 13.032773733139038
Here are Dask reports for that:
To me it looks like we're still spending a lot of time creating and destroying rmm.DeviceBuffer
s in the UCX runs. Does that match your understanding as well?
Did we ever try enabling RMM at the top of ucp/__init__.py
to make sure that it is always active?
Does appear that way. Not sure about people's experience. Maybe others can comment?
@pentschev added an RMM plugin to dask-cuda, which I'm guessing he's using here (though I could be wrong). ( https://github.com/rapidsai/dask-cuda/pull/236 ) Even without that people have been pretty good about enabling RMM in all workflows. So don't think it is an issue of not having RMM enabled.
More likely (to parrot @kkraus14 π) this is showing us that RMM's pool allocator (CNMeM) is experiencing degrading performance due to lots of allocations/deallocations. Here's the numbers to back that up.
Yes, we're always using RMM pool. Without RMM pool there's basically two scenarios:
UCX_CUDA_IPC_CACHE=y
); orUCX_CUDA_IPC_CACHE=n
).More likely (to parrot @kkraus14 wink) this is showing us that RMM's pool allocator (CNMeM) is experiencing degrading performance due to lots of allocations/deallocations. Here's the numbers to back that up.
Ah, I now see the reason to bulk allocate in the Dask UCX comm
That's one way to go about it at least. Another would be some sort of optimization at the graph level to avoid the need for as much communication to begin with ( https://github.com/dask/dask/issues/5809 ). Perhaps you have other ideas still? π
I think that that optimization is generally a good idea, but that won't affect the particular profile results we're seeing here, right?
On Tue, Feb 11, 2020, 4:08 PM jakirkham notifications@github.com wrote:
That's one way to go about it at least. Another would be some sort of optimization at the graph level to avoid the need for as much communication to begin with ( dask/dask#5809 https://github.com/dask/dask/issues/5809 ). Perhaps you have other ideas still? π
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rapidsai/ucx-py/issues/402?email_source=notifications&email_token=AACKZTF3EYA4DXP4X7QQDKLRCM4YVA5CNFSM4KM5MV42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELOVTWQ#issuecomment-584931802, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTALUXEGO7RUGM2VZH3RCM4YVANCNFSM4KM5MV4Q .
Honestly I haven't looked into that thread in any detail. So I can't say.
Anyways open to other ideas if something occurs to you π
I think that that optimization is generally a good idea, but that won't affect the particular profile results we're seeing here, right?
Yup it really should not effect these profile results.
Thats just to prevent redundant columns being transferred over which is not the case here.
Sure though redundant columns being transferred would exacerbate the problem we are seeing here.
We can also get a sense that the memory pool is degrading in performance by looking at the first allocations made when copying in the data from host compared to later allocations during transfers.
Initial allocations:
Allocations for transfers:
The other option we have is to possibly dispatch to some different cuDF methods in the case of the shuffle functions / other memory allocation heavy functions to contiguous_*
APIs that return multiple DataFrames all backed by a single RMM allocation.
It should relieve the allocation pressure on the sender and then we can also look at optimizing the sends / receives to send entire DataFrames as one contiguous chunk of memory instead of going down inside of Columns.
Another option is to optimize RMM... cnmem uses a free list which requires linear search. I may be able to resurrect my prototype allocator which is easier to modify to use a set or other tree-based data structure.
Yep these are all good ideas.
Using bigger allocations less frequently generally seems helpful (where possible). Perhaps we will see the downside of that if we go too far (as Keith mentioned to me offline)?
Improving the allocator's performance also seems really useful.
While I think our working theory is reasonable, I'd like us to consider a few other possible theories/poke holes in our existing one (we may rule them out quickly):
If 1 is true, we also have a lot of allocations happening when we restore spilled data. So could also degrade RMM performance over time. Is this (not) happening? If so, what denies/confirms that hypothesis?
If 2 is true, we could get a lot of allocations due to trying to receive data and failing for some reason only to try again with a fresh allocation. Is there any indication we are seeing this?
Finally what other things might be causing frequent allocations/deallocations? The answer might be none, but it's worth pausing for a moment and making sure this is true. π
For my efforts on RMM, can you help me get some data about the allocations?
Any or all of the above will be helpful. Total number is most important and probably easiest for you to provide.
Thanks!
1) Do you want the number of allocations / deallocations per process or total across all the processes? I'd assume the former but want to be sure.
2) This may be a bit difficult to get as Python doesn't track the allocations anywhere outside of normal reference counting.
3) Generally random with a skew towards FIFO. In general many of the allocations are likely temporaries that are freed relatively quickly, but things can stay alive arbitrarily because of the nature of Python + reference counting.
For a process where you think RMM allocation / free is expensive.
Use RMM logging. Need to set this variable to true (https://github.com/harrism/rmm/blob/153224aa51da27fc1d6478a8997d31d2a5d9e48a/include/rmm/rmm.hpp#L55), recompile, and then enable logging in the RMM initialization. Then you would need to call rmm.csv_log()
(can do that from a notebook and then analyze in Pandas/cuDF if you want).
I will just go with random.
I have a simple benchmark that allocates N random blocks from 1 to k bytes, freeing with a certain probability at each iteration, or when the maximum memory size is reached. I can use this to profile and optimize, I think. With 100,000 allocations of at most 2MB and a max allocated size of 87% of 16GB, it takes 30s with cnmem. If you think this is a sufficiently close comparison, I'll just use it. Or you can give me different parameters.
That sounds good to me. Additionally I wrote a quick Python script with much more synthetic behavior: https://github.com/rapidsai/ucx-py/issues/402#issuecomment-580776895 which could be good to use for seeing performance across allocations versus frees as well.
As we are always using DeviceBuffer
, I think we can wrap it with an object that returns us info about the allocation that occurred. Also could store timestamps on each allocation/deallocation as well. Should give us plenty of details. Will run through this workflow tomorrow and update with what I find. Please let me know if you see any issues with this or if Iβve left anything out π
RMM logging will give you the sizes, device IDs, pointers, timestamps, and the source (file and line) of the call (probably all would be DeviceBuffer since it only goes one level up the call stack), but your own logging can give you more context if it helps, I guess.
Thanks for that info. Yeah this is pretty close to what I'd want.
We'd want to get some more context about what happens in Python (like line numbers in Python files). In particular am hoping to discover how much was due to things like allocating buffers for receiving data vs. spilling.
For my efforts on RMM, can you help me get some data about the allocations?
- Total number of allocations / deallocations
At least for the MRE given above we are looking at 298948 allocations and deallocations.
- Size distribution: min, max, mean
Interestingly there are a lot of 0
size allocations (as discussed offline). So that's the min
. Of the non-zero values though the smallest is 4
(somewhat surprising to me at least).
The max is 4776520
.
The mean including 0
s is 363015
(rounded) and without 0
s is 471108
(so about ~10% of the max).
The standard deviation is including 0
s is 864456
(rounded) and without 0
s is 958580
(so about ~10% of the max).
- Any ordering information you can provide, e.g. FIFO, LIFO, completely random, etc.
Will see if I can extract this from the data, but I think Keith is right that this has a FIFO skew.
At least for the MRE given above we are looking at 298948 allocations and deallocations.
Do you happen to have any info about what the largest number of allocations we had alive at any given time is? I think that's important as well because if we're just doing 1000 allocations, 1000 deallocations repeatedly we wouldn't have performance problems.
Thanks for this. My random benchmark isn't too far off. But answering Keith's question would really help.
Certainly I'll poke at the time info next π
Not time. Maximum number of active allocations.
Adding some plots below to hopefully give more context. This is for one worker, but other workers look similar.
The first plot shows a histogram of the number of allocations for a particular number of bytes. The second plot shows how many allocations are alive over in "time steps" (when an allocation occurs).
This looks like we're keep the number of allocations to a very reasonable amount.
Dask-cudf multi partition merge slows down with
ucx
.Dask-cudf merge seems to slow down with
ucx
.Wall time: (15.4 seconds on tcp) vs (37.8 s on ucx) (exp-01)
In the attached example we see a slow down with
ucx
vs just usingtcp
.Wall Times on exp-01
UCX Time
TCP times
Repro Code:
Helper Function to create distributed dask-cudf frame
RMM Setup:
Merge Code:
The slow down happens on the merge step.
Additional Context:
There has been discussion about this on our internal slack channel, please see for more context.