rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.03k stars 871 forks source link

[BUG] cudf JAVA binding failed Maximum pool size exceeded in LargeTableTest #16199

Closed pxLi closed 2 days ago

pxLi commented 6 days ago

Describe the bug cudf_nightly-dev run: 1286,1287 (recent 2 runs)

ai.rapids.cudf.LargeTableTest failed in both cuda11+12 (A30 and L4 with 24GB mem)

[2024-07-04T12:47:34.687Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.0:test (default-test) on project cudf: There are test failures.
[2024-07-04T12:47:34.687Z] [ERROR] 
[2024-07-04T12:47:34.687Z] [ERROR] Please refer to /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-1287-cuda11/java/target/surefire-reports for the individual test results.
[2024-07-04T12:47:34.687Z] [ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, [date].dumpstream and [date]-jvmRun[N].dumpstream.
[2024-07-04T12:47:34.687Z] [ERROR] There was an error in the forked process
[2024-07-04T12:47:34.687Z] [ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-1287-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:255: Maximum pool size exceeded
[2024-07-04T12:47:34.687Z] [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[2024-07-04T12:47:34.687Z] [ERROR] Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-1287-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:255: Maximum pool size exceeded
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:658)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:533)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:278)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:244)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1194)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1022)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:868)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
[2024-07-04T12:47:34.687Z] [ERROR]  at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.apache.maven.cli.MavenCli.execute(MavenCli.java:954)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
[2024-07-04T12:47:34.688Z] [ERROR]  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[2024-07-04T12:47:34.688Z] [ERROR]  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[2024-07-04T12:47:34.688Z] [ERROR]  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2024-07-04T12:47:34.688Z] [ERROR]  at java.lang.reflect.Method.invoke(Method.java:498)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
[2024-07-04T12:47:34.688Z] [ERROR]  at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

Expected behavior pass the UT

Environment overview (please complete the following information)

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context Add any other context about the problem here.

davidwendt commented 5 days ago

This looks very familiar to the error I got when running CI while working on #16037 https://github.com/rapidsai/cudf/actions/runs/9649886400/job/26623988951#step:9:2361

Can you check that your libcudf cmake build is set with -DCUDF_LARGE_STRINGS_DISABLED? Otherwise you can disable large strings by setting the environment variable like here: https://github.com/rapidsai/cudf/blob/37defc6b943094921200146c5f6042a91e68c75a/ci/test_java.sh#L43

Or perhaps this is something else but the error looked very similar.

cc @jlowe

jlowe commented 2 days ago

Thanks for taking a look, @davidwendt. In #16037 I missed that we need to add the CUDF_LARGE_STRINGS_DISABLED flag to the Java pom. I added it to spark-rapids-jni which we use for the Spark plugin, but I missed it for the cudf jar artifact. I'll post a PR shortly.