rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.32k stars 887 forks source link

[BUG] cuDF JNI does not set RMM_LOGGING_LEVEL #15417

Open abellina opened 6 months ago

abellina commented 6 months ago

In cuDF we build with object files that reference RMM headers to create memory resource, such as the pool_memory_resource.

There is a flag that is not being set in cuDF JNI RMM_LOGGING_LEVEL that appears to be causing extra logs when the new pinned memory pool is exhausted (we run out of pinned memory). In this case, RMM logs at error level: maximum pool size was exceeded.

We'd like to find a solution for 24.04. We are looking at other ways of setting the flag from spark-rapids-jni as well in the mean time.

abellina commented 6 months ago

Related bug but in cpp: https://github.com/rapidsai/cudf/issues/15416

abellina commented 6 months ago

The issue with this flag not being set is that we get a rmm_log.txt with lines like this:

[ 61931][10:48:59:493968][error ] [A][Stream 0x0][Upstream 506799616B][FAILURE maximum pool size exceeded]

each time we run out of pinned memory. This is not desirable for production environments as our pool is meant to be opportunistic and these failures are really just going to result in pageable memory being allocated.