rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.23k stars 532 forks source link

[BUG] Strange memory error in umap benchmarks #4607

Open cjnolet opened 2 years ago

cjnolet commented 2 years ago

While running the benchmarks notebook for 22.04, I encountered a strange failure in the umap-unsupervised benchmark:

Failed to run with 16384 samples, 1000 features: std::bad_alloc: CUDA error at: /home/cjnolet/miniconda3/envs/cuml_2204_022222_2/include/rmm/mr/device/cuda_memory_resource.hpp
Failed to run with 16384 samples, 10000 features: CURAND_STATUS_INITIALIZATION_FAILED
Failed to run with 32768 samples, 1000 features: CURAND_STATUS_INITIALIZATION_FAILED
Failed to run with 32768 samples, 10000 features: CURAND_STATUS_INITIALIZATION_FAILED
Failed to run with 65536 samples, 1000 features: CURAND_STATUS_INITIALIZATION_FAILED
Failed to run with 65536 samples, 10000 features: CURAND_STATUS_INITIALIZATION_FAILED

I was able to reproduce this error using the same benchmarking code outside of jupyter:

import pandas as pd
from cuml.benchmark.runners import SpeedupComparisonRunner
from cuml.benchmark.algorithms import algorithm_by_name

VERBOSE=True
RUN_CPU=False

def enrich_result(algorithm, runner, result):
    result["algo"] = algorithm
    result["dataset_name"] = runner.dataset_name
    result["input_type"] = runner.input_type
    return result

def execute_benchmark(algorithm, runner, verbose=VERBOSE, run_cpu=RUN_CPU, **kwargs):
    results = runner.run(algorithm_by_name(algorithm), verbose=verbose, run_cpu=run_cpu, **kwargs)
    results = [enrich_result(algorithm, runner, result) for result in results]
    benchmark_results.extend(results)

def run():

    import cuml
    import pandas as pd

    from cuml.benchmark.runners import SpeedupComparisonRunner
    from cuml.benchmark.algorithms import algorithm_by_name

    import warnings
    warnings.filterwarnings('ignore', 'Expected column ')

    print(cuml.__version__)

    N_REPS = 3  # Number of times each test is repeated

    DATA_NEIGHBORHOODS = "blobs"
    DATA_CLASSIFICATION = "classification"
    DATA_REGRESSION = "regression"

    INPUT_TYPE = "numpy"

    benchmark_results = []

    SMALL_ROW_SIZES = [2**x for x in range(14, 17)]
    LARGE_ROW_SIZES = [2**x for x in range(18, 24, 2)]

    SKINNY_FEATURES = [32, 256]
    WIDE_FEATURES = [1000, 10000]

    runner = cuml.benchmark.runners.SpeedupComparisonRunner(
        bench_rows=SMALL_ROW_SIZES,
        bench_dims=WIDE_FEATURES,
        dataset_name=DATA_NEIGHBORHOODS,
        input_type=INPUT_TYPE,
        n_reps=N_REPS
    )

    execute_benchmark("UMAP-Unsupervised", runner)

if __name__ == "__main__":
    run()

This throws the following error in compute-sanitizer:

========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 4 bytes
=========     at 0x200 in void raft::sparse::linalg::detail::coo_symmetrize_kernel<(int)128, float, void UMAPAlgo::FuzzySimplSet::Naive::launcher<(int)256, long, float>(int, const T2 *, const T3 *, int, raft::sparse::detail::COO<T3, int> *, ML::UMAPParams *, CUstream_st *)::[lambda(int, int, float, float) (instance 1)]>(int *, int *, int *, T2 *, int *, int *, T2 *, int, int, T3)
=========     by thread (77,0,0) in block (25,0,0)
=========     Address 0x7f51ef623200 is out of bounds
=========     and is 12801 bytes after the nearest allocation at 0x7f51ef600000 of size 131072 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x21740c]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x1402c]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_022222_2/lib/python3.9/site-packages/cupy_backends/cuda/api/../../../../../libcudart.so.11.0
=========     Host Frame:cudaLaunchKernel [0x68808]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_022222_2/lib/python3.9/site-packages/cupy_backends/cuda/api/../../../../../libcudart.so.11.0
=========     Host Frame:void raft::sparse::linalg::detail::coo_symmetrize<128, float, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(int, long const*, float const*, int, raft::sparse::detail::COO<float, int>*, ML::UMAPParams*, CUstream_st*), &(void UMAPAlgo::FuzzySimplSet::Naive::launcher<256, long, float>(int, long const*, float const*, int, raft::sparse::detail::COO<float, int>*, ML::UMAPParams*, CUstream_st*)), 1u>, float> >(raft::sparse::detail::COO<float, int>*, raft::sparse::detail::COO<float, int>*, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(int, long const*, float const*, int, raft::sparse::detail::COO<float, int>*, ML::UMAPParams*, CUstream_st*), &(void UMAPAlgo::FuzzySimplSet::Naive::launcher<256, long, float>(int, long const*, float const*, int, raft::sparse::detail::COO<float, int>*, ML::UMAPParams*, CUstream_st*)), 1u>, float>, CUstream_st*) [0xb94345]

I am unable to reproduce the error outside of the benchmark harness. For example, this code doesn't have any errors in compute-sanitizer:

if __name__ == "__main__":
    from cuml.datasets import make_blobs
    from cuml.manifold import UMAP as cuUMAP

    n_samples = 16000
    n_features = 1000

    for i in range(1):
        data, labels = make_blobs(n_samples=n_samples, n_features=n_features,
                              centers=10, random_state=42)

        #    model = umap.UMAP(n_neighbors=10, min_dist=0.1)
        cuml_model = cuUMAP(n_neighbors=10, min_dist=0.01)

        #    embedding = model.fit_transform(data)
        cuml_embedding = cuml_model.fit_transform(data, convert_dtype=True)

The code above does trip the compute-sanitizer --tool=initcheck, however:

=========                in /home/cjnolet/miniconda3/envs/cuml_2204_022222_2/bin/python
========= 
========= Uninitialized __global__ memory read of size 8 bytes
=========     at 0x1e0 in void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_copy::functor<thrust::tuple<float, long, thrust::null_typ
e, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> *, thrust::pointer<thrust::tuple<float, long, thrust::null_type, thrust::nu
ll_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::execute_on_stream, thrust::use_default, thrust::use_default>>, long>,
 thrust::cuda_cub::__uninitialized_copy::functor<thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, th
rust::null_type> *, thrust::pointer<thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_ty
pe>, thrust::cuda_cub::execute_on_stream, thrust::use_default, thrust::use_default>>, long>(T2, T3)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f8a4e41f400
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x21740c]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x1402c]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_022222_2/lib/python3.9/site-packages/cupy_backends/cuda/api/../../../../../libcudart.so.11.0
=========     Host Frame:cudaLaunchKernel [0x68808]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_022222_2/lib/python3.9/site-packages/cupy_backends/cuda/api/../../../../../libcudart.so.11.0
=========     Host Frame:void thrust::cuda_cub::parallel_for<thrust::cuda_cub::execute_on_stream, thrust::cuda_cub::__uninitialized_copy::functor<thrust::tuple<float, long, thrust::null_type, thrust::null_type, 
thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>*, thrust::pointer<thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_t
ype, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::execute_on_stream, thrust::use_default, thrust::use_default> >, long>(thrust::cuda_cub::exec
ution_policy<thrust::cuda_cub::execute_on_stream>&, thrust::cuda_cub::__uninitialized_copy::functor<thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::
null_type, thrust::null_type, thrust::null_type, thrust::null_type>*, thrust::pointer<thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thr
ust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::execute_on_stream, thrust::use_default, thrust::use_default> >, long) [0xa05f9f]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_021922/lib/libcuml++.so
=========     Host Frame:thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>* thrust
::cuda_cub::__copy::cross_system_copy_n<thrust::cuda_cub::execute_on_stream, thrust::system::cpp::detail::tag, thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_typ
e, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>*, long, thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, th
rust::null_type, thrust::null_type, thrust::null_type>*>(thrust::cuda_cub::execution_policy<thrust::cuda_cub::execute_on_stream>&, thrust::system::cpp::detail::execution_policy<thrust::system::cpp::detail::tag>&
, thrust::tuple<float, long, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>*, long, thrust::tuple<float, l
ong, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>*, thrust::detail::integral_constant<bool, false>) [clo
ne .constprop.0] [0xb79b4f]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_021922/lib/libcuml++.so
=========     Host Frame:thrust::device_ptr<float const> thrust::cuda_cub::max_element<thrust::cuda_cub::execute_on_stream, thrust::device_ptr<float const>, thrust::less<float> >(thrust::cuda_cub::execution_poli
cy<thrust::cuda_cub::execute_on_stream>&, thrust::device_ptr<float const>, thrust::device_ptr<float const>, thrust::less<float>) [0xba3a10]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_021922/lib/libcuml++.so
=========     Host Frame:void UMAPAlgo::SimplSetEmbed::Algo::launcher<256, float>(int, int, raft::sparse::detail::COO<float, int>*, ML::UMAPParams*, float*, CUstream_st*) [0xbc18ad]
=========                in /home/cjnolet/miniconda3/envs/cuml_2204_021922/lib/libcuml++.so

We encountered a similar issue in TSNE when with one of cub's max/min reduction functions. I'm not sure if it's the problem or not. This error is being thrown in the SimplicialSetEmbed while the previous error from the benchmarks is being thrown earlier in the FuzzySimplicialSet.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.