rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.02k stars 871 forks source link

Nightly memcheck failure caused by compute-sanitizer bug #15258

Open davidwendt opened 4 months ago

davidwendt commented 4 months ago

Describe the issue Nightly builds are failing due to memcheck errors in specific gtests. The error appears to be compute-sanitizer tool issue which has been opened as nvbug 4553815. This issue is to document the issue while working on possible workarounds until the bug is fixed.

The 2 errors appear as follows:

[ RUN      ] NumericValueIteratorTest/1.non_null_iterator
========= Invalid __shared__ read of size 16 bytes
=========     at 0x9670 in void cub::CUB_200200_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_700_750_800_860_900_NS::DeviceReducePolicy<short, unsigned int, thrust::minimum<void>>::Policy600, short *, short *, unsigned int, thrust::minimum<void>, short, short>(T2, T3, T4, T5, T6)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331d50]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x14fb4]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x70aae]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaError cub::CUB_200200_700_750_800_860_900_NS::DeviceReduce::Reduce<short*, short*, thrust::minimum<void>, short, int>(void*, unsigned long&, short*, short*, int, thrust::minimum<void>, short, CUstream_st*) [clone .isra.0] [0x2fa199]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/ITERATOR_TEST
[ RUN      ] MinMaxReductionTest/0.MinMaxTypes
========= Invalid __shared__ read of size 16 bytes
=========     at 0x4310 in void cub::CUB_200200_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_700_750_800_860_900_NS::DeviceReducePolicy<short, unsigned int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>>::Policy600, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short *, unsigned int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, short>(T2, T3, T4, T5, T6)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x8 is misaligned
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331d50]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x14fb4]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x70aae]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
=========     Host Frame:cudaError cub::CUB_200200_700_750_800_860_900_NS::DeviceReduce::Reduce<thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short*, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, int>(void*, unsigned long&, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short*, int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, CUstream_st*) [clone .isra.0] [0x18950ae]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)::{lambda()#2}::operator()() const [0x18984c3]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:std::unique_ptr<cudf::scalar, std::default_delete<cudf::scalar> > cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1898a70]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::reduction::detail::min(cudf::column_view const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x187ea46]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:decltype(auto) cudf::detail::aggregation_dispatcher<cudf::reduction::detail::reduce_dispatch_functor, cudf::reduce_aggregation const&>(cudf::aggregation::Kind, cudf::reduction::detail::reduce_dispatch_functor&&, cudf::reduce_aggregation const&) [0x193431e]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::reduction::detail::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1934d71]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:cudf::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, rmm::mr::device_memory_resource*) [0x193583f]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
=========     Host Frame:std::pair<short, bool> ReductionTest<short>::reduction_test<short>(cudf::column_view const&, cudf::reduce_aggregation const&, std::optional<cudf::data_type>) [clone .constprop.0] [0x28ec47]
=========                in /opt/conda/envs/test/bin/gtests/libcudf/./REDUCTIONS_TEST

If these were real errors the should appear when running without compute-sanitizer. The nvbug report includes a small reproducer that shows the error without any libcudf-specific code.

Steps/Code to reproduce

compute-sanitizer --tool memcheck gtests/ITERATOR_TEST --gtest_filter=NumericValueIteratorTest/1.non_null_iterator --rmm_mode=cuda
compute-sanitizer --tool memcheck gtests/REDUCTIONS_TEST--gtest_filter=MinMaxReductionTest/0.MinMaxTypes --rmm_mode=cuda

Note the failure only occurs on int16 (short) integer types when doing a min-reduction through CUB.

Additional context The error occurs as follows on various compute-sanitizer versions:

2022.3.0    ok
2022.4.0    ok
2022.4.1    fail
2023.1.1    fail
2023.2.2.0  fail
2023.3.1    fail

In general, it fails only with 12.0 and above.

vyasr commented 1 month ago

We are skipping the problematic test to avoid CI failures as of #15259, but we don't have a fix for the underlying issue yet.