Describe the issue
Nightly builds are failing due to memcheck errors in specific gtests. The error appears to be compute-sanitizer tool issue which has been opened as nvbug 4553815.
This issue is to document the issue while working on possible workarounds until the bug is fixed.
The 2 errors appear as follows:
[ RUN ] NumericValueIteratorTest/1.non_null_iterator
========= Invalid __shared__ read of size 16 bytes
========= at 0x9670 in void cub::CUB_200200_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_700_750_800_860_900_NS::DeviceReducePolicy<short, unsigned int, thrust::minimum<void>>::Policy600, short *, short *, unsigned int, thrust::minimum<void>, short, short>(T2, T3, T4, T5, T6)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x8 is misaligned
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x331d50]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x14fb4]
========= in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
========= Host Frame:cudaLaunchKernel [0x70aae]
========= in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
========= Host Frame:cudaError cub::CUB_200200_700_750_800_860_900_NS::DeviceReduce::Reduce<short*, short*, thrust::minimum<void>, short, int>(void*, unsigned long&, short*, short*, int, thrust::minimum<void>, short, CUstream_st*) [clone .isra.0] [0x2fa199]
========= in /opt/conda/envs/test/bin/gtests/libcudf/ITERATOR_TEST
[ RUN ] MinMaxReductionTest/0.MinMaxTypes
========= Invalid __shared__ read of size 16 bytes
========= at 0x4310 in void cub::CUB_200200_700_750_800_860_900_NS::DeviceReduceSingleTileKernel<cub::CUB_200200_700_750_800_860_900_NS::DeviceReducePolicy<short, unsigned int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>>::Policy600, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short *, unsigned int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, short>(T2, T3, T4, T5, T6)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x8 is misaligned
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x331d50]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x14fb4]
========= in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
========= Host Frame:cudaLaunchKernel [0x70aae]
========= in /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
========= Host Frame:cudaError cub::CUB_200200_700_750_800_860_900_NS::DeviceReduce::Reduce<thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short*, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, int>(void*, unsigned long&, thrust::transform_iterator<thrust::identity<short>, thrust::transform_iterator<cudf::detail::value_accessor<short>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, thrust::use_default, thrust::use_default>, short*, int, cudf::detail::cast_functor_fn<short, cudf::DeviceMin>, short, CUstream_st*) [clone .isra.0] [0x18950ae]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)::{lambda()#2}::operator()() const [0x18984c3]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:std::unique_ptr<cudf::scalar, std::default_delete<cudf::scalar> > cudf::reduction::simple::detail::simple_reduction<short, short, cudf::reduction::detail::op::min>(cudf::column_view const&, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1898a70]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:cudf::reduction::detail::min(cudf::column_view const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x187ea46]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:decltype(auto) cudf::detail::aggregation_dispatcher<cudf::reduction::detail::reduce_dispatch_functor, cudf::reduce_aggregation const&>(cudf::aggregation::Kind, cudf::reduction::detail::reduce_dispatch_functor&&, cudf::reduce_aggregation const&) [0x193431e]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:cudf::reduction::detail::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, std::optional<std::reference_wrapper<cudf::scalar const> >, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x1934d71]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:cudf::reduce(cudf::column_view const&, cudf::reduce_aggregation const&, cudf::data_type, rmm::mr::device_memory_resource*) [0x193583f]
========= in /opt/conda/envs/test/bin/gtests/libcudf/../../../lib/libcudf.so
========= Host Frame:std::pair<short, bool> ReductionTest<short>::reduction_test<short>(cudf::column_view const&, cudf::reduce_aggregation const&, std::optional<cudf::data_type>) [clone .constprop.0] [0x28ec47]
========= in /opt/conda/envs/test/bin/gtests/libcudf/./REDUCTIONS_TEST
If these were real errors the should appear when running without compute-sanitizer.
The nvbug report includes a small reproducer that shows the error without any libcudf-specific code.
Describe the issue Nightly builds are failing due to memcheck errors in specific gtests. The error appears to be
compute-sanitizer
tool issue which has been opened as nvbug 4553815. This issue is to document the issue while working on possible workarounds until the bug is fixed.The 2 errors appear as follows:
If these were real errors the should appear when running without
compute-sanitizer
. The nvbug report includes a small reproducer that shows the error without any libcudf-specific code.Steps/Code to reproduce
Note the failure only occurs on int16 (short) integer types when doing a min-reduction through CUB.
Additional context The error occurs as follows on various
compute-sanitizer
versions:In general, it fails only with 12.0 and above.