Open abellina opened 1 year ago
The new design for cuda::mr::memory_resource
, which expands support in RMM to include pinned host memory pools, is tracked in this branch https://github.com/rapidsai/rmm/pull/1095.
Note that cuda::mr::memory_resource doesn't directly expand support to include pinned host memory pools, it just enables us to more easily reuse the implementation of stream-ordered device memory pools for other kinds of device-accessible memory.
@abellina It's really hard to tell in the image how much time is saved by this change. Can you provide comparable benchmark results?
Sure, this is for NDS @ 3TB in our performance cluster, the benchmark was executed 3 times for baseline vs test. We see around 6% improvement. I need to look at query95 more because it found a regression there, but overall this was a win:
--------------------------------------------------------------------
Name = query2
Means = 3088.3333333333335, 2494.6666666666665
Time diff = 593.666666666667
Speedup = 1.2379743452699092
T-Test (test statistic, p value, df) = 7.8686774190832445, 0.0014098464938042464, 4.0
T-Test Confidence Interval = 384.1927204113131, 803.1406129220209
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query9
Means = 11011.666666666666, 8408.0
Time diff = 2603.666666666666
Speedup = 1.3096653980336188
T-Test (test statistic, p value, df) = 8.329919755519573, 0.0011349387959612939, 4.0
T-Test Confidence Interval = 1735.8386701912746, 3471.4946631420576
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query23_part2
Means = 23309.0, 22541.666666666668
Time diff = 767.3333333333321
Speedup = 1.0340406654343808
T-Test (test statistic, p value, df) = 3.5857045809141983, 0.023049979254582895, 4.0
T-Test Confidence Interval = 173.17984709011603, 1361.4868195765482
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query28
Means = 8749.666666666666, 7877.0
Time diff = 872.6666666666661
Speedup = 1.1107866785155092
T-Test (test statistic, p value, df) = 3.675737490963056, 0.021283509364083526, 4.0
T-Test Confidence Interval = 213.50341001605773, 1531.8299233172743
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query31
Means = 3846.3333333333335, 3217.6666666666665
Time diff = 628.666666666667
Speedup = 1.1953796747125247
T-Test (test statistic, p value, df) = 3.476522797723856, 0.02543224663133342, 4.0
T-Test Confidence Interval = 126.59646864855881, 1130.7368646847751
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query74
Means = 5813.333333333333, 5295.0
Time diff = 518.333333333333
Speedup = 1.0978910922253697
T-Test (test statistic, p value, df) = 4.145944413928419, 0.014307288000811205, 4.0
T-Test Confidence Interval = 171.21723564533386, 865.4494310213322
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query75
Means = 8002.0, 7139.0
Time diff = 863.0
Speedup = 1.120885278050147
T-Test (test statistic, p value, df) = 6.9462130566740035, 0.0022563913623439248, 4.0
T-Test Confidence Interval = 518.0534649259677, 1207.9465350740325
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query83
Means = 11889.0, 10728.333333333334
Time diff = 1160.666666666666
Speedup = 1.1081870436538759
T-Test (test statistic, p value, df) = 7.983215413800722, 0.0013345137828189723, 4.0
T-Test Confidence Interval = 757.0038418026304, 1564.3294915307017
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query88
Means = 8288.333333333334, 6815.666666666667
Time diff = 1472.666666666667
Speedup = 1.2160708172348023
T-Test (test statistic, p value, df) = 14.785515131088996, 0.00012180808800452909, 4.0
T-Test Confidence Interval = 1196.1272209995159, 1749.206112333818
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query95
Means = 8023.0, 9527.333333333334
Time diff = -1504.333333333334
Speedup = 0.8421034217339584
T-Test (test statistic, p value, df) = -5.098045536832821, 0.006992118575497059, 4.0
T-Test Confidence Interval = -2323.6078748695036, -685.0587917971645
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = benchmark
Means = 418333.3333333333, 394333.3333333333
Time diff = 24000.0
Speedup = 1.0608622147083686
T-Test (test statistic, p value, df) = 5.622255427989818, 0.004920931820960854, 4.0
T-Test Confidence Interval = 12148.051368670785, 35851.948631329215
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
So what are the other ones with speedups of 21%, 19%, etc.?
So what are the other ones with speedups of 21%, 19%, etc.?
Sorry I should describe the benchmark better. This is NDS https://github.com/NVIDIA/spark-rapids-benchmarks/tree/dev/nds where we are running it in "power run" mode. In this case we are running queries in series one after the other in a cluster running spark rapids and 8xA100 GPUs. The results above show when one of the queries has significant regressions or speedups, and it also has a "benchmark" section at the end for the overall (sum of all query times compared between baseline and test). I ran these sets of tests three times, so the comparison tool is looking at the means and the variance and figuring out what is noise and what isn't.
From the queries that have significant speedup, query9 and query88 are two queries we know are parquet/scan bound. Looking at traces for these, you'll mostly fine unsnap and parquet decode kernels. We see a lot of benefit here.
Is this helping with your question @harrism, or am I missing it?
Yes. So the 6% is overall average benefit. Good to know.
As of 24.02 you can create a pool_memory_resource<pinned_host_memory_resource>
. I suggest that libcudf / cuIO adds a cudf::get/set_current_pinned_host_memory_resource
to use for this. This could also be added to RMM, but
a) The static resource RMM currently stores for get/set_current_device_memory_resource
is one of the reasons RMM is not Windows compatible, so I'm averse to adding a second instance of that. Whereas libcudf has a binary library component so it can maintain that state without affecting platform compatibility.
b) I think it's good to try out this feature in libcudf and then move it lower in the stack later if it would be useful to other clients.
There is some design work around how and where to put this, and how to wire up configuration knobs for the initial / maximum pool size. Also how to expose that to Python and Spark.
There are other places where this pool could be useful but let's start here.
Thank you @harrism for kicking off the design discussion.
a) would you please clarify the Windows compatibility issue? Is it the presence of a static object with certain properties in a header, or some other issue?
b) If libcudf added cudf::get/set_current_pinned_host_memory_resource
, would you please share a bit more about the components in RMM this would be modeled after? I'm not familiar enough with RMM to understand the scope of this suggestion.
Thank you for your help
This issue describes a): https://github.com/rapidsai/rmm/issues/826
Basically the function local static works on Linux but not on windows because each binary gets a it's own instance and so they won't be the same across DLLs.
Although I suppose adding a duplicate of the incompatibility reason doesn't make it more incompatible, it's just more of the same.
I still think it should be tested in libcudf first.
For b) the answer is to model it after get/set_current_device_resource(), which is the function with the function local static.
But mostly I was suggesting that libcudf should implement it the way libcudf wants it to be for libcudf.
Thank you too!
I just discussed this idea a bit more with @abellina, @mattahrens, @GregoryKimball, and @vuule.
We're leaning towards adding a parameter for host_mr
to the Parquet reading APIs. We're leaning away from using a global/static host allocator because we want this behavior to be friendly/simple if dealing with multiple threads like Spark does. Also callers may want to use different host allocators.
The host_mr
would be optional, and would use a non-pooled allocator by default. This avoids some of the concerns about determining a default host pool size, and how that would be handled by Python cudf and other consumers of the library. The host_mr
should allow any of: a pinned host allocator (which is effectively the current behavior), a pooling pinned host allocator, or possibly even an adapter that implements an "opportunistically pinned pool" (Spark's desired use case; this would attempt allocating pinned memory, and fall back to allocating paged memory rather than failing to allocate).
@abellina is going to investigate a bit further and pursue an implementation for review.
Sounds good. Because rmm::mr::pinned_host_memory_resource
is not a device_memory_resource
, you will need to use the new rmm::device_async_resource_ref
for the host_mr
parameter in the Parquet reading APIs. We will be transitioning to that in all of RAPIDS, and in the interim, this will make the parameter compatible with both legacy device_memory_resource
/host_memory_resource
MRs AND the newer MRs like pinned_host_memory_resource
that just implement the cuda::async_memory_resource
policy.
For an example, see #14873
One thing to be aware of for deciding your initial pool size is that if you make it too small, you will get unnecessary fragmentation. The reason is that the RMM pool MR grows by just allocating a new chunk. This isn't TOO bad because it uses a geometric growth strategy, but allocations that don't fit will cause new allocations that can't be merged with the previous pool chunks, hence fragmentation.
Thank you for the meeting today and thank you @bdice for sharing this summary.
host_mr
parameter to the Parquet reading APIs would be consistent with the way we use the mr
parameter today. For this approach I would love to sketch out the design changes to hostdevice_vector
and pinned_host_vector
that would be required to make use of an input host_mr
. pinned_host_vector
as in the prototype by @abellina seems to have the advantage of not adding additional plumbing to the IO modules that use the hostdevice_vector
utility. I'm curious also to hear from @vuule. Plus @nvdbaranec and @etseidl if you would like to weigh in.
The introduction of the host_mr
option would require substantial code changes to pass it to all places where the hostdevice_vector
s are created. And this would still be limited the PQ reader, until we do the same for other components.
I don't think we know at this point if this would bring additional benefit compared to a global API to set the resource. My proposal is to start with a global resource (that keeps the default behavior) and consider modifying the APIs once we see the effect of the global resource in production. Development would be pretty similar to the stream support (AFAICT), with a global default, and recently included stream parameters in most APIs.
I have a separate concern about having a host_mr
parameter that is not an equivalent to the existing mr param. The pinned mem allocator does not change the output for the user, it's purely an internal optimization. But this is maybe just a naming issue, very secondary at this point.
I am more of a fan of having a global function that sets an rmm allocator to be used for pinned allocations. For now we could limit it to cuIO only (which would really mean just hostdevice_vector
). Something similar to the way we initialize rmm (rmm::mr::set_current_device_resource(resource.get());
)
Maybe cudf::io::set_current_pinned_memory_resource
The pinned mem allocator does not change the output for the user, it's purely an internal optimization
This is a good observation. My primary concern was about ensuring thread safety for the global/static approach (what happens if that allocator is changed while in use / before its previous allocations are freed?). If that's not a problem, then I'm okay with using a global/static pinned (pool) host memory resource. I thought the host_mr
argument might give us more flexibility for our potential long term needs, but I don't want to overengineer for that case given how little we rely on pinned memory today.
Just measured the pinned memory peak use in the Parquet reader. In the benchmarks, which create tables with 512MB of data, the largest peak pinned memory use I saw was about 3.8MB, with integer columns. Many cases use around 1MB. So we should be able to make great use of a <1GB pool even with many threads.
But isn't 512MB much smaller than real world use cases?
BTW after @vuule and @nvdbaranec pointed it out I also think starting with a global config for an allocator that should be used internally in cuIO is a much better place to start than plumbing it in everywhere for flexibility.
The introduction of the
host_mr
option would require substantial code changes to pass it to all places where thehostdevice_vector
s are created. And this would still be limited the PQ reader, until we do the same for other components. I don't think we know at this point if this would bring additional benefit compared to a global API to set the resource. My proposal is to start with a global resource (that keeps the default behavior) and consider modifying the APIs once we see the effect of the global resource in production. Development would be pretty similar to the stream support (AFAICT), with a global default, and recently included stream parameters in most APIs.I have a separate concern about having a
host_mr
parameter that is not an equivalent to the existing mr param. The pinned mem allocator does not change the output for the user, it's purely an internal optimization. But this is maybe just a naming issue, very secondary at this point.
@vuule could spark-rapids still provide its own memory pool for this? We wouldn't want a default pinned allocator to initialize, then uninitialized because it is going to be replaced by our own allocator.
But isn't 512MB much smaller than real world use cases?
Sure, but the measurements show that pinned memory requirement are less than 1% of the device memory required to read a PQ file. Even when fully using a 50GB GPU we won't fill a 1GB pinned pool.
Oh, I guess I don't know how parquet reading works. :)
We are investigating using pinned memory pool at the cuDF layer and replacing
cudaFreeHost
calls inpinned_host_vector
due to traces we have seen that indicate synchronization or a "lining up" of kernels during parquet decode. Here's query88 from NDS at 3TB on our performance cluster running with an A100. In the nsys trace (pardon the amount of streams), we can see parquet nvcomp and decode kernels working on the first three quarters of the trace:The bottom trace is cuDF without changes. The top trace is a modified cuDF where we replaced calls to cudaMallocHost and cudaFreeHost with
allocate
anddeallocate
against a modified RMMpool_memory_resource
that isn't stream aware and has a single free list.When we run with the modified cuDF, our NDS benchmark shows a 5% improvement at 3TB and a 6% improvement between old cuDF and new cuDF if we allow all 16 spark threads to submit work concurrently. In other words, we believe the
cudaFreeHost
calls specifically are preventing parquet heavy jobs from using more of the GPU due to synchronization.The proposal here is to allow a pinned memory pool to be passed to parquet primarily, but there are probably other formats and areas in cuDF that might benefit from this.
Note that another experiment we wanted to attempt was to remove pinned memory alltogether, which cuDF already has a flag for
LIBCUDF_IO_PREFER_PAGEABLE_TMP_MEMORY=1
, but we ran into issues for parquet only (https://github.com/rapidsai/cudf/issues/14311). Before I found this flag, I had tried replacingcudaMallocHost
andcudaFreeHost
withmalloc
andfree
and I ran into the same issue, so I think the parquet code is dependent on some sort of synchronization in the CUDA host pinned memory allocator.