nv-legate / cunumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cunumeric/24.06/
Apache License 2.0
607 stars 69 forks source link

[BUG] cuNumeric crashes with SIGABRT/SIGFPE when computing the convolution of certain shapes on GPU #1085

Open yimoj opened 9 months ago

yimoj commented 9 months ago

Software versions

Python : 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) Platform : Linux-5.15.0-79-generic-x86_64-with-glibc2.31 Legion : v23.11.00.dev-29-g57265c0a Legate : 23.11.00.dev+29.g57265c0a Cunumeric : 23.11.00.dev+20.g6fda4437 Numpy : 1.26.0 Scipy : 1.11.3 Numba : 0.58.1 CTK package : cuda-version-12.0-hffde075_2 (conda-forge) GPU driver : 550.00 GPU devices : GPU 0: NVIDIA A2 GPU 1: NVIDIA A10 GPU 2: NVIDIA A10 GPU 3: NVIDIA A40

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

For the two shapes below convovle and print works fine with NumPy/SciPy:

import scipy.signal as sig
import numpy as np
arr1 = np.random.random((1024, 2, 1024))
arr2 = np.random.random((5, 1, 5))
print(sig.convolve(arr1, arr2, 'same').shape)

arr1 = np.random.random((1024, 2, 1024))
arr2 = np.random.random((128, 1, 128))
print(sig.convolve(arr1, arr2, 'same').shape)

Observed behavior

SIGABRT

import cunumeric as num
arr1 = num.random.random((1024, 2, 1024))
arr2 = num.random.random((5, 1, 5))
num.convolve(arr1, arr2, 'same')

SIGFPE

import cunumeric as num
arr1 = num.random.random((1024, 2, 1024))
arr2 = num.random.random((128, 1, 128))
num.convolve(arr1, arr2, 'same')

Example code or instructions

legate --gpus 1

And then the code above.

Stack traceback or browser console output

SIGABRT trace

 legion_python: /opt/legate/cunumeric/src/cunumeric/convolution/convolve.cu:661: void cunumeric::launch_small_tile_kernel(legate::AccessorWO<T, DIM>, legate::AccessorRO<T, DIM>, legate::AccessorRO<T, DIM>, legate::Rect<DIM>&, legate::Rect<DIM>&, legate::Rect<DIM>&, const cudaDeviceProp&, const unsigned int*, const unsigned int*, legate::Point<DIM>&, unsigned int, size_t) [with VAL = double; int DIM = 3; legate::AccessorWO<T, DIM> = Legion::FieldAccessor<LEGION_WRITE_DISCARD, double, 3, long long int, Realm::AffineAccessor<double, 3, long long int>, false>; legate::AccessorRO<T, DIM> = Legion::FieldAccessor<LEGION_READ_PRIV, double, 3, long long int, Realm::AffineAccessor<double, 3, long long int>, false>; legate::Rect<DIM> = Realm::Rect<3, long long int>; legate::Point<DIM> = Realm::Point<3, long long int>; size_t = long unsigned int]: Assertion `(input_pitch * sizeof(VAL)) == smem_size' failed.                                                               Signal 6 received by node 0, process 163 (thread 7fdc20b50000) - obtaining backtrace                                                                               Signal 6 received by process 163 (thread 7fdc20b50000) at: stack trace: 19 frames                                                                                    [0] = raise at unknown file:0 [00007fdc357cc00b]                                                                                                                   [1] = abort at unknown file:0 [00007fdc357ab858]                                                                                                                   [2] = unknown symbol at unknown file:0 [00007fdc357ab728]                                                                                                          [3] = __assert_fail at unknown file:0 [00007fdc357bcfd5]                                                                                                           [4] = void cunumeric::cufft_convolution<double, 3>(Legion::FieldAccessor<(legion_privilege_mode_t)268435463, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)1, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)1, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Realm::Rect<3, long long> const&, Realm::Rect<3, long long> const&, Realm::Rect<3, long long> const&) at unknown file:0 [00007fd98b7b7462]                                                                                [5] = void cunumeric::ConvolveImpl<(cunumeric::VariantKind)2>::operator()<(legate::Type::Code)11, 3, (void*)0>(cunumeric::ConvolveArgs&) const at unknown file:0 [00007fd98b7b7965]                                                                                                                                                   [6] = void cunumeric::convolve_template<(cunumeric::VariantKind)2>(legate::TaskContext&) at unknown file:0 [00007fd98b756141]                                      [7] = legate::detail::task_wrapper(void (*)(legate::TaskContext&), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fdc0c33e369]
  [8] = void legate::LegateTask<cunumeric::ConvolveTask>::legate_task_wrapper<&cunumeric::ConvolveTask::gpu_variant>(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fd989cb38bf]
  [9] = Realm::Cuda::GPUProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) at unknown file:0 [00007fdc3610a48e]
  [10] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007fdc360f6982]
  [11] = Realm::KernelThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007fdc360f6a09]
  [12] = Realm::Cuda::GPUTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task(Realm::Task*) at unknown file:0 [00007fdc3612f418]
  [13] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007fdc360f4fd3]
  [14] = Realm::ThreadedTaskScheduler::scheduler_loop_wlock() at unknown file:0 [00007fdc360f5540]
  [15] = Realm::KernelThread::pthread_entry(void*) at unknown file:0 [00007fdc360f9615]
  [16] = start_thread at unknown file:0 [00007fdc319f7608]
  [17] = __clone at unknown file:0 [00007fdc358a8132]
  [18] = unknown symbol at unknown file:0 [ffffffffffffffff]

SIGFPE trace

Signal 8 received by node 0, process 194 (thread 7fe1d0ffd000) - obtaining backtrace
Signal 8 received by process 194 (thread 7fe1d0ffd000) at: stack trace: 15 frames
  [0] = void cunumeric::cufft_convolution<double, 3>(Legion::FieldAccessor<(legion_privilege_mode_t)268435463, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)1, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Legion::FieldAccessor<(legion_privilege_mode_t)1, double, 3, long long, Realm::AffineAccessor<double, 3, long long>, false>, Realm::Rect<3, long long> const&, Realm::Rect<3, long long> const&, Realm::Rect<3, long long> const&) at unknown file:0 [00007fdf437b5af8]
  [1] = void cunumeric::ConvolveImpl<(cunumeric::VariantKind)2>::operator()<(legate::Type::Code)11, 3, (void*)0>(cunumeric::ConvolveArgs&) const at unknown file:0 [00007fdf437b7965]
  [2] = void cunumeric::convolve_template<(cunumeric::VariantKind)2>(legate::TaskContext&) at unknown file:0 [00007fdf43756141]
  [3] = legate::detail::task_wrapper(void (*)(legate::TaskContext&), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fe1bc5df369]
  [4] = void legate::LegateTask<cunumeric::ConvolveTask>::legate_task_wrapper<&cunumeric::ConvolveTask::gpu_variant>(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fdf41cb38bf]
  [5] = Realm::Cuda::GPUProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) at unknown file:0 [00007fe1ec73148e]
  [6] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007fe1ec71d982]
  [7] = Realm::KernelThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007fe1ec71da09]
  [8] = Realm::Cuda::GPUTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task(Realm::Task*) at unknown file:0 [00007fe1ec756418]
  [9] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007fe1ec71bfd3]
  [10] = Realm::ThreadedTaskScheduler::scheduler_loop_wlock() at unknown file:0 [00007fe1ec71c540]
  [11] = Realm::KernelThread::pthread_entry(void*) at unknown file:0 [00007fe1ec720615]
  [12] = start_thread at unknown file:0 [00007fe1e801e608]
  [13] = __clone at unknown file:0 [00007fe1ebecf132]
  [14] = unknown symbol at unknown file:0 [ffffffffffffffff]
manopapad commented 9 months ago

@lightsighter It looks like the issue is when we feed centers[d] == 0 and tile[d] == 0 (for some d) to the launch_small_tile_kernel function https://github.com/nv-legate/cunumeric/blob/branch-24.01/src/cunumeric/convolution/convolve.cu#L608 (which happens when one dimension of the filter array is 1). Any quick fix we could try?

lightsighter commented 6 months ago

I can't believe you guys are still using this code. 🙄

@yimoj try again with this patch:

diff --git a/src/cunumeric/convolution/convolve.cu b/src/cunumeric/convolution/convolve.cu
index 7d185d6d..33add3cc 100644
--- a/src/cunumeric/convolution/convolve.cu
+++ b/src/cunumeric/convolution/convolve.cu
@@ -803,7 +803,8 @@ __host__ void direct_convolution(AccessorWO<VAL, DIM> out,
   for (int d = DIM - 1; d >= 0; d--) {
     // Make sure that each tile is at least double the size of the filter
     // so that we can get some savings in bandwidth needed
-    tile[d] = 2 * centers[d];
+    tile[d] = 2 * extents[d];
+    assert(tile[d] > 0);
     if (d == (DIM - 1)) {
       // In order to maximize bandwidth, we want to make sure we're loading at
       // least 128B of contiguous memory along the last axis (row-major) of input
@@ -1344,7 +1345,8 @@ __host__ static inline void cufft_convolution(AccessorWO<VAL, DIM> out,
   for (int d = DIM - 1; d >= 0; d--) {
     // Make sure that each tile is at least double the size of the filter
     // so that we can get some savings in bandwidth needed
-    tile[d] = 2 * centers[d];
+    tile[d] = 2 * extents[d];
+    assert(tile[d] > 0);
     if (d == (DIM - 1)) {
       // In order to maximize bandwidth, we want to make sure we're loading at
       // least 128B of contiguous memory along the last axis (row-major) of input

I'm pretty sure it should always have been like this, was just a typo.

SIGABRT trace

I'm unable to reproduce this failure mode with the latest internal core and cunumeric.