Open abellina opened 6 months ago
Related bug but in cpp: https://github.com/rapidsai/cudf/issues/15416
The issue with this flag not being set is that we get a rmm_log.txt
with lines like this:
[ 61931][10:48:59:493968][error ] [A][Stream 0x0][Upstream 506799616B][FAILURE maximum pool size exceeded]
each time we run out of pinned memory. This is not desirable for production environments as our pool is meant to be opportunistic and these failures are really just going to result in pageable memory being allocated.
In cuDF we build with object files that reference RMM headers to create memory resource, such as the pool_memory_resource.
There is a flag that is not being set in cuDF JNI
RMM_LOGGING_LEVEL
that appears to be causing extra logs when the new pinned memory pool is exhausted (we run out of pinned memory). In this case, RMM logs aterror
level:maximum pool size was exceeded
.We'd like to find a solution for 24.04. We are looking at other ways of setting the flag from spark-rapids-jni as well in the mean time.