[FEA] Pinned memory pools for parquet decode

abellina commented 1 year ago

We are investigating using pinned memory pool at the cuDF layer and replacing cudaFreeHost calls in pinned_host_vector due to traces we have seen that indicate synchronization or a "lining up" of kernels during parquet decode. Here's query88 from NDS at 3TB on our performance cluster running with an A100. In the nsys trace (pardon the amount of streams), we can see parquet nvcomp and decode kernels working on the first three quarters of the trace:

Screenshot from 2023-10-23 13-08-15

The bottom trace is cuDF without changes. The top trace is a modified cuDF where we replaced calls to cudaMallocHost and cudaFreeHost with allocate and deallocate against a modified RMM pool_memory_resource that isn't stream aware and has a single free list.

When we run with the modified cuDF, our NDS benchmark shows a 5% improvement at 3TB and a 6% improvement between old cuDF and new cuDF if we allow all 16 spark threads to submit work concurrently. In other words, we believe the cudaFreeHost calls specifically are preventing parquet heavy jobs from using more of the GPU due to synchronization.

The proposal here is to allow a pinned memory pool to be passed to parquet primarily, but there are probably other formats and areas in cuDF that might benefit from this.

Note that another experiment we wanted to attempt was to remove pinned memory alltogether, which cuDF already has a flag for LIBCUDF_IO_PREFER_PAGEABLE_TMP_MEMORY=1, but we ran into issues for parquet only (https://github.com/rapidsai/cudf/issues/14311). Before I found this flag, I had tried replacing cudaMallocHost and cudaFreeHost with malloc and free and I ran into the same issue, so I think the parquet code is dependent on some sort of synchronization in the CUDA host pinned memory allocator.

GregoryKimball commented 1 year ago

The new design for cuda::mr::memory_resource, which expands support in RMM to include pinned host memory pools, is tracked in this branch https://github.com/rapidsai/rmm/pull/1095.

harrism commented 1 year ago

Note that cuda::mr::memory_resource doesn't directly expand support to include pinned host memory pools, it just enables us to more easily reuse the implementation of stream-ordered device memory pools for other kinds of device-accessible memory.

harrism commented 1 year ago

@abellina It's really hard to tell in the image how much time is saved by this change. Can you provide comparable benchmark results?

abellina commented 1 year ago

Sure, this is for NDS @ 3TB in our performance cluster, the benchmark was executed 3 times for baseline vs test. We see around 6% improvement. I need to look at query95 more because it found a regression there, but overall this was a win:

--------------------------------------------------------------------
Name = query2
Means = 3088.3333333333335, 2494.6666666666665
Time diff = 593.666666666667
Speedup = 1.2379743452699092
T-Test (test statistic, p value, df) = 7.8686774190832445, 0.0014098464938042464, 4.0
T-Test Confidence Interval = 384.1927204113131, 803.1406129220209
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query9
Means = 11011.666666666666, 8408.0
Time diff = 2603.666666666666
Speedup = 1.3096653980336188
T-Test (test statistic, p value, df) = 8.329919755519573, 0.0011349387959612939, 4.0
T-Test Confidence Interval = 1735.8386701912746, 3471.4946631420576
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query23_part2
Means = 23309.0, 22541.666666666668
Time diff = 767.3333333333321
Speedup = 1.0340406654343808
T-Test (test statistic, p value, df) = 3.5857045809141983, 0.023049979254582895, 4.0
T-Test Confidence Interval = 173.17984709011603, 1361.4868195765482
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query28
Means = 8749.666666666666, 7877.0
Time diff = 872.6666666666661
Speedup = 1.1107866785155092
T-Test (test statistic, p value, df) = 3.675737490963056, 0.021283509364083526, 4.0
T-Test Confidence Interval = 213.50341001605773, 1531.8299233172743
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query31
Means = 3846.3333333333335, 3217.6666666666665
Time diff = 628.666666666667
Speedup = 1.1953796747125247
T-Test (test statistic, p value, df) = 3.476522797723856, 0.02543224663133342, 4.0
T-Test Confidence Interval = 126.59646864855881, 1130.7368646847751
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query74
Means = 5813.333333333333, 5295.0
Time diff = 518.333333333333
Speedup = 1.0978910922253697
T-Test (test statistic, p value, df) = 4.145944413928419, 0.014307288000811205, 4.0
T-Test Confidence Interval = 171.21723564533386, 865.4494310213322
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query75
Means = 8002.0, 7139.0
Time diff = 863.0
Speedup = 1.120885278050147
T-Test (test statistic, p value, df) = 6.9462130566740035, 0.0022563913623439248, 4.0
T-Test Confidence Interval = 518.0534649259677, 1207.9465350740325
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query83
Means = 11889.0, 10728.333333333334
Time diff = 1160.666666666666
Speedup = 1.1081870436538759
T-Test (test statistic, p value, df) = 7.983215413800722, 0.0013345137828189723, 4.0
T-Test Confidence Interval = 757.0038418026304, 1564.3294915307017
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query88
Means = 8288.333333333334, 6815.666666666667
Time diff = 1472.666666666667
Speedup = 1.2160708172348023
T-Test (test statistic, p value, df) = 14.785515131088996, 0.00012180808800452909, 4.0
T-Test Confidence Interval = 1196.1272209995159, 1749.206112333818
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query95
Means = 8023.0, 9527.333333333334
Time diff = -1504.333333333334
Speedup = 0.8421034217339584
T-Test (test statistic, p value, df) = -5.098045536832821, 0.006992118575497059, 4.0
T-Test Confidence Interval = -2323.6078748695036, -685.0587917971645
ALERT: significant change has been detected (p-value < 0.05)
ALERT: regression in performance has been observed
--------------------------------------------------------------------
Name = benchmark
Means = 418333.3333333333, 394333.3333333333
Time diff = 24000.0
Speedup = 1.0608622147083686
T-Test (test statistic, p value, df) = 5.622255427989818, 0.004920931820960854, 4.0
T-Test Confidence Interval = 12148.051368670785, 35851.948631329215
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed

harrism commented 1 year ago

So what are the other ones with speedups of 21%, 19%, etc.?

abellina commented 1 year ago

So what are the other ones with speedups of 21%, 19%, etc.?

Sorry I should describe the benchmark better. This is NDS https://github.com/NVIDIA/spark-rapids-benchmarks/tree/dev/nds where we are running it in "power run" mode. In this case we are running queries in series one after the other in a cluster running spark rapids and 8xA100 GPUs. The results above show when one of the queries has significant regressions or speedups, and it also has a "benchmark" section at the end for the overall (sum of all query times compared between baseline and test). I ran these sets of tests three times, so the comparison tool is looking at the means and the variance and figuring out what is noise and what isn't.

From the queries that have significant speedup, query9 and query88 are two queries we know are parquet/scan bound. Looking at traces for these, you'll mostly fine unsnap and parquet decode kernels. We see a lot of benefit here.

Is this helping with your question @harrism, or am I missing it?

harrism commented 1 year ago

Yes. So the 6% is overall average benefit. Good to know.

harrism commented 9 months ago

As of 24.02 you can create a pool_memory_resource<pinned_host_memory_resource>. I suggest that libcudf / cuIO adds a cudf::get/set_current_pinned_host_memory_resource to use for this. This could also be added to RMM, but

a) The static resource RMM currently stores for get/set_current_device_memory_resource is one of the reasons RMM is not Windows compatible, so I'm averse to adding a second instance of that. Whereas libcudf has a binary library component so it can maintain that state without affecting platform compatibility. b) I think it's good to try out this feature in libcudf and then move it lower in the stack later if it would be useful to other clients.

There is some design work around how and where to put this, and how to wire up configuration knobs for the initial / maximum pool size. Also how to expose that to Python and Spark.

There are other places where this pool could be useful but let's start here.

GregoryKimball commented 9 months ago

Thank you @harrism for kicking off the design discussion. a) would you please clarify the Windows compatibility issue? Is it the presence of a static object with certain properties in a header, or some other issue? b) If libcudf added cudf::get/set_current_pinned_host_memory_resource, would you please share a bit more about the components in RMM this would be modeled after? I'm not familiar enough with RMM to understand the scope of this suggestion.

Thank you for your help

harrism commented 9 months ago

This issue describes a): https://github.com/rapidsai/rmm/issues/826

Basically the function local static works on Linux but not on windows because each binary gets a it's own instance and so they won't be the same across DLLs.

Although I suppose adding a duplicate of the incompatibility reason doesn't make it more incompatible, it's just more of the same.

I still think it should be tested in libcudf first.

For b) the answer is to model it after get/set_current_device_resource(), which is the function with the function local static.

But mostly I was suggesting that libcudf should implement it the way libcudf wants it to be for libcudf.

Thank you too!

bdice commented 9 months ago

I just discussed this idea a bit more with @abellina, @mattahrens, @GregoryKimball, and @vuule.

We're leaning towards adding a parameter for host_mr to the Parquet reading APIs. We're leaning away from using a global/static host allocator because we want this behavior to be friendly/simple if dealing with multiple threads like Spark does. Also callers may want to use different host allocators.

The host_mr would be optional, and would use a non-pooled allocator by default. This avoids some of the concerns about determining a default host pool size, and how that would be handled by Python cudf and other consumers of the library. The host_mr should allow any of: a pinned host allocator (which is effectively the current behavior), a pooling pinned host allocator, or possibly even an adapter that implements an "opportunistically pinned pool" (Spark's desired use case; this would attempt allocating pinned memory, and fall back to allocating paged memory rather than failing to allocate).

@abellina is going to investigate a bit further and pursue an implementation for review.

harrism commented 9 months ago

Sounds good. Because rmm::mr::pinned_host_memory_resource is not a device_memory_resource, you will need to use the new rmm::device_async_resource_ref for the host_mr parameter in the Parquet reading APIs. We will be transitioning to that in all of RAPIDS, and in the interim, this will make the parameter compatible with both legacy device_memory_resource/host_memory_resource MRs AND the newer MRs like pinned_host_memory_resource that just implement the cuda::async_memory_resource policy.

For an example, see #14873

harrism commented 9 months ago

One thing to be aware of for deciding your initial pool size is that if you make it too small, you will get unnecessary fragmentation. The reason is that the RMM pool MR grows by just allocating a new chunk. This isn't TOO bad because it uses a geometric growth strategy, but allocations that don't fit will cause new allocations that can't be merged with the previous pool chunks, hence fragmentation.

See https://github.com/harrism/rmm/blob/6bb0ef2973821c2a1b8f952298046e7b406dc9d2/include/rmm/mr/device/pool_memory_resource.hpp#L352

GregoryKimball commented 9 months ago

Thank you for the meeting today and thank you @bdice for sharing this summary.

Adding a host_mr parameter to the Parquet reading APIs would be consistent with the way we use the mr parameter today. For this approach I would love to sketch out the design changes to hostdevice_vector and pinned_host_vector that would be required to make use of an input host_mr.
Using a static memory resource in pinned_host_vector as in the prototype by @abellina seems to have the advantage of not adding additional plumbing to the IO modules that use the hostdevice_vector utility.

I'm curious also to hear from @vuule. Plus @nvdbaranec and @etseidl if you would like to weigh in.

vuule commented 9 months ago

The introduction of the host_mr option would require substantial code changes to pass it to all places where the hostdevice_vectors are created. And this would still be limited the PQ reader, until we do the same for other components. I don't think we know at this point if this would bring additional benefit compared to a global API to set the resource. My proposal is to start with a global resource (that keeps the default behavior) and consider modifying the APIs once we see the effect of the global resource in production. Development would be pretty similar to the stream support (AFAICT), with a global default, and recently included stream parameters in most APIs.

I have a separate concern about having a host_mr parameter that is not an equivalent to the existing mr param. The pinned mem allocator does not change the output for the user, it's purely an internal optimization. But this is maybe just a naming issue, very secondary at this point.

nvdbaranec commented 9 months ago

I am more of a fan of having a global function that sets an rmm allocator to be used for pinned allocations. For now we could limit it to cuIO only (which would really mean just hostdevice_vector). Something similar to the way we initialize rmm (rmm::mr::set_current_device_resource(resource.get());)

Maybe cudf::io::set_current_pinned_memory_resource

bdice commented 9 months ago

The pinned mem allocator does not change the output for the user, it's purely an internal optimization

This is a good observation. My primary concern was about ensuring thread safety for the global/static approach (what happens if that allocator is changed while in use / before its previous allocations are freed?). If that's not a problem, then I'm okay with using a global/static pinned (pool) host memory resource. I thought the host_mr argument might give us more flexibility for our potential long term needs, but I don't want to overengineer for that case given how little we rely on pinned memory today.

vuule commented 9 months ago

Just measured the pinned memory peak use in the Parquet reader. In the benchmarks, which create tables with 512MB of data, the largest peak pinned memory use I saw was about 3.8MB, with integer columns. Many cases use around 1MB. So we should be able to make great use of a <1GB pool even with many threads.

harrism commented 9 months ago

But isn't 512MB much smaller than real world use cases?

harrism commented 9 months ago

BTW after @vuule and @nvdbaranec pointed it out I also think starting with a global config for an allocator that should be used internally in cuIO is a much better place to start than plumbing it in everywhere for flexibility.

abellina commented 9 months ago

The introduction of the host_mr option would require substantial code changes to pass it to all places where the hostdevice_vectors are created. And this would still be limited the PQ reader, until we do the same for other components. I don't think we know at this point if this would bring additional benefit compared to a global API to set the resource. My proposal is to start with a global resource (that keeps the default behavior) and consider modifying the APIs once we see the effect of the global resource in production. Development would be pretty similar to the stream support (AFAICT), with a global default, and recently included stream parameters in most APIs.

I have a separate concern about having a host_mr parameter that is not an equivalent to the existing mr param. The pinned mem allocator does not change the output for the user, it's purely an internal optimization. But this is maybe just a naming issue, very secondary at this point.

@vuule could spark-rapids still provide its own memory pool for this? We wouldn't want a default pinned allocator to initialize, then uninitialized because it is going to be replaced by our own allocator.

vuule commented 9 months ago

But isn't 512MB much smaller than real world use cases?

Sure, but the measurements show that pinned memory requirement are less than 1% of the device memory required to read a PQ file. Even when fully using a 50GB GPU we won't fill a 1GB pinned pool.

harrism commented 9 months ago

Oh, I guess I don't know how parquet reading works. :)

rapidsai / cudf

[FEA] Pinned memory pools for parquet decode #14314