rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
https://docs.rapids.ai/api/raft/stable/
Apache License 2.0
683 stars 181 forks source link

[BUG] run wiki_all_88m on NV A100 with raft-ann-bench will crash #2203

Open ftian1 opened 4 months ago

ftian1 commented 4 months ago

Describe the bug it will raise below error on NV A100 GPU.

raft_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/process_time/real_time ERROR OCCURRED: 'Failed to create an algo: std::bad_alloc: out_of_memory: RMM failure at:/sparse/miniconda3/envs/py310/include/rmm/mr/device/pool_memory_resource.hpp:313: Maximum pool size exceeded'

Steps/Code to reproduce bug

python -m raft-ann-bench.run --dataset wiki_all_88M --dataset-path ./ --algorithms raft_cagra --build

Expected behavior run benchmark succeed

Environment details (please complete the following information): Bare-metal installation on Ubuntu Raft was installed by conda install -c rapidsai -c conda-forge raft-ann-bench-gpu

Slyne commented 3 months ago

I saw this error when I used conda install. And when I turn to use docker container: https://docs.rapids.ai/api/raft/stable/raft_ann_benchmarks/#docker , the issue disappears.