BF_STATUS_INVALID_ARGUMENT in TestMap and TransposeTest suite

league commented 2 years ago

I'm going to try to summarize some test-suite issues I experienced, starting with TestMap just because it seems to be the simplest symptoms to summarize. These runs were on a VM from qblocks.cloud, with GeForce RTX 3090. The pre-installed nvcc was CUDA 10.0, although the driver says it supports CUDA 11.2. From nvcc --version and --help:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
Supported architectures: 30 32 35 37 50 52 53 60 61 62 70 72 75

and from nvidia-smi:

| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|   0  GeForce RTX 3090    On   | 00000000:17:00.0 Off |                  N/A |

I ran the tests with different architecture configurations, and some of them with multiple trials in case of transient issues:

./configure --with-gpu-archs="70 72 75" I'll call this 7x (2 trials)
./configure --with-gpu-archs="30 32 35 37" This is 3x (2 trials)
./configure --with-gpu-archs="30 32 35 37 50 52 53 60 61 62 70 72 75" call this all

For the tests under discussion here, 14 methods in test_map.TestMap and 4 methods in test_transpose.TransposeTest, the results were consistent across all 5 trials.

All 14 tests in this class raised BF_STATUS_INVALID_ARGUMENT from libbifrost.py:122 which is in the _check function that gets the status string. The caller of that was map.py:140 inside the map function, processing the result of bfMap. Here is a representative stack trace:

======================================================================
ERROR: test_broadcast (test_map.TestMap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/qblocks/bifrost/test/test_map.py", line 97, in test_broadcast
    bf.map("c = a*b", data={'a': a, 'b': b, 'c': c})
  File "/usr/local/lib/python3.6/dist-packages/bifrost/map.py", line 144, in map
    _array(block_shape), _array(block_axes)))
  File "/usr/local/lib/python3.6/dist-packages/bifrost/libbifrost.py", line 122, in _check
    raise RuntimeError(status_str)
RuntimeError: b'BF_STATUS_INVALID_ARGUMENT'

The TransposeTest class produces similar errors, from _check following bfTranspose. Here is one of those traces:

======================================================================
ERROR: test_2byte (test_transpose.TransposeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/qblocks/bifrost/test/test_transpose.py", line 58, in test_2byte
    self.run_simple_test_shmoo(np.uint16)
  File "/home/qblocks/bifrost/test/test_transpose.py", line 54, in run_simple_test_shmoo
    self.run_simple_test(perm, dtype, shape)
  File "/home/qblocks/bifrost/test/test_transpose.py", line 45, in run_simple_test
    bf.transpose.transpose(oarray, iarray, axes)
  File "/usr/local/lib/python3.6/dist-packages/bifrost/transpose.py", line 46, in transpose
    _check(_bf.bfTranspose(src_bf, dst_bf, axes_array))
  File "/usr/local/lib/python3.6/dist-packages/bifrost/libbifrost.py", line 122, in _check
    raise RuntimeError(status_str)
RuntimeError: b'BF_STATUS_INVALID_ARGUMENT'

Again, consistent across all 5 trials and all 5 test methods in TransposeTest... except for test_16byte which fails with KeyError: 128 in DataType.py:189. On line 86 of that file I see if BF_FLOAT128_ENABLED, which I suppose it is not... so presumably `test_16byte' should be skipped.

I'll be able to summarize and perhaps troubleshoot other errors from these VM runs, affecting classes TestLinAlg, PipelineTest, TestManagedMap, and TestFFT... the overall results were:

3x: FAILED (failures=3, errors=31, skipped=17)
7x: FAILED (failures=3, errors=33, skipped=5) [it ran the TestManagedMap]
all: FAILED (failures=3, errors=31, skipped=17)

jaycedowell commented 2 years ago

Can you compile/run with debugging enabled? That should provide a more verbose output from map when it fails.

league commented 2 years ago

Will try --enable-debug with all of these eventually, but on short notice I could somewhat reproduce the issue with TransposeTest.test_XXbyte on Colab. They pass with the auto-configured GPU arch of just "37" (although the CUDA 11 compiler complains that 35, 37, and 50 are deprecated).

But when I did --with-gpu-arch="70 72 75"as I had it on qblocks… they produced this:

.CUDA ERROR: no kernel image is available for execution on the device
transpose.cu:225 Condition failed: error == cudaSuccess
transpose.cu:225 error 99: BF_STATUS_INTERNAL_ERROR
E
======================================================================
ERROR: test_4byte (test_transpose.TransposeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/bifrost/test/test_transpose.py", line 60, in test_4byte
    self.run_simple_test_shmoo(np.uint32)
  File "/root/bifrost/test/test_transpose.py", line 54, in run_simple_test_shmoo
    self.run_simple_test(perm, dtype, shape)
  File "/root/bifrost/test/test_transpose.py", line 45, in run_simple_test
    bf.transpose.transpose(oarray, iarray, axes)
  File "/usr/local/lib/python3.7/dist-packages/bifrost/transpose.py", line 46, in transpose
    _check(_bf.bfTranspose(src_bf, dst_bf, axes_array))
  File "/usr/local/lib/python3.7/dist-packages/bifrost/libbifrost.py", line 122, in _check
    raise RuntimeError(status_str)
RuntimeError: b'BF_STATUS_INTERNAL_ERROR'

----------------------------------------------------------------------
Ran 2 tests in 0.883s

FAILED (errors=1)

The BF_STATUS_INTERNAL_ERROR comes about because of a debug-only BF_ASSERT in transpose.cu. I'm confident that without debugging it would propagate as BF_STATUS_INVALID_ARGUMENT.

I need to find some descriptions of how these arch codes relate to each other and what they're meant to do. My model (or assumption) of what they mean has, let's say, severely limited explanatory power. :)

jaycedowell commented 2 years ago

I'll give qblocks a shot and see what I can find.

jaycedowell commented 2 years ago

I'm on a RTX3090 system with CUDA 11.0. configure auto-detects 80 as the arch. and I also get lots of errors when I run the test_map suite. With --enable-debug I see things like:

  1 #include "Complex.hpp"
  2 #include "Vector.hpp"
  3 #include "ArrayIndexer.cuh"
  4 #include "ShapeIndexer.cuh"
  5 extern "C"
  6 __global__
  7 void map_kernel(float* a_ptr,
  8 float* b_ptr,
  9 float* c_ptr) {
 10   enum { NDIM = 2 };
 11   typedef StaticIndexArray<int,1,1> _Shape;
 12   typedef StaticShapeIndexer<_Shape> _ShapeIndexer;
 13   typedef StaticIndexArray<int,89> _Shape_a;
 14   typedef StaticIndexArray<int,4> _Strides_a;
 15   typedef StaticArrayIndexer<float,_Shape_a,_Strides_a> _ArrayIndexer_a;
 16   typedef StaticIndexArray<int,89,1> _Shape_b;
 17   typedef StaticIndexArray<int,4,0> _Strides_b;
 18   typedef StaticArrayIndexer<float,_Shape_b,_Strides_b> _ArrayIndexer_b;
 19   typedef StaticIndexArray<int,89,89> _Shape_c;
 20   typedef StaticIndexArray<int,356,4> _Strides_c;
 21   typedef StaticArrayIndexer<float,_Shape_c,_Strides_c> _ArrayIndexer_c;
 22   const int _shape[NDIM] = {89, 89}; (void)_shape[0];
 23   int _x0 = threadIdx.x + blockIdx.x*blockDim.x;
 24   int _y0 = threadIdx.y + blockIdx.y*blockDim.y;
 25   int _z0 = blockIdx.z;
 26   for( int _z=_z0; _z<_ShapeIndexer::SIZE; _z+=gridDim.z ) {
 27   for( int _y=_y0; _y<89; _y+=blockDim.y*gridDim.y ) {
 28   for( int _x=_x0; _x<89; _x+=blockDim.x*gridDim.x ) {
 29     auto _composite_index  = _ShapeIndexer::lift(_z);
 30     _composite_index[1] = _x;
 31     _composite_index[0] = _y;
 32     auto const& _  = _composite_index;
 33     _ArrayIndexer_a a(a_ptr, _);
 34     typedef float a_type;
 35     _ArrayIndexer_b b(b_ptr, _);
 36     typedef float b_type;
 37     _ArrayIndexer_c c(c_ptr, _);
 38     typedef float c_type;
 39     c = a*b;
 40   }
 41   }
 42   }
 43 }
---------------------------------------------------
--- JIT compile log for program bfMap ---
---------------------------------------------------
nvrtc: error: invalid value for --gpu-architecture (-arch)

---------------------------------------------------
map.cpp:487 error 10: BF_STATUS_INVALID_ARGUMENT

which is strange. I don't know why nvrtc would throw an "invalid value for --gpu-architecture" error but I bet it is something down here: https://github.com/ledatelescope/bifrost/blob/autoconf/src/map.cpp#L332

jaycedowell commented 2 years ago

Ah, when I dig into get_cuda_device_cc() it's returning 86 but nvcc only supports up to 80.

jaycedowell commented 2 years ago

Ok, 1e47784 fixes test_map and test_transpose for me.

Although this issue doesn't explicitly list it I'll also look at test_fft.

jaycedowell commented 2 years ago

test_fft gives me four failures, all are related to complex-to-real transforms:

======================================================================
FAIL: test_c2r_1D (test_fft.TestFFT)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/qblocks/bifrost/test/test_fft.py", line 200, in test_c2r_1D
    self.run_test_c2r(self.shape1D, [0])
  File "/home/qblocks/bifrost/test/test_fft.py", line 143, in run_test_c2r
    self.run_test_c2r_impl(shape, axes)
  File "/home/qblocks/bifrost/test/test_fft.py", line 136, in run_test_c2r_impl
    compare(odata.copy('system'), known_result)
  File "/home/qblocks/bifrost/test/test_fft.py", line 51, in compare
    np.testing.assert_allclose(result, gold, rtol=RTOL, atol=MTOL * absmean)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1528, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.1, atol=0.00462357

Mismatched elements: 1 / 16777216 (5.96e-06%)
Max absolute difference: 0.01639435
Max relative difference: 1.93911746
 x: ndarray([  374.53613, -5428.0957 ,  -363.20837, ..., -3117.7966 ,
         -3706.6025 ,  2153.2935 ], dtype=float32)
 y: array([  374.536103, -5428.096355,  -363.208565, ..., -3117.800304,
       -3706.602722,  2153.296084])

======================================================================
FAIL: test_c2r_2D (test_fft.TestFFT)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/qblocks/bifrost/test/test_fft.py", line 202, in test_c2r_2D
    self.run_test_c2r(self.shape2D, [0, 1])
  File "/home/qblocks/bifrost/test/test_fft.py", line 144, in run_test_c2r
    self.run_test_c2r_impl(shape, axes, fftshift=True)
  File "/home/qblocks/bifrost/test/test_fft.py", line 136, in run_test_c2r_impl
    compare(odata.copy('system'), known_result)
  File "/home/qblocks/bifrost/test/test_fft.py", line 51, in compare
    np.testing.assert_allclose(result, gold, rtol=RTOL, atol=MTOL * absmean)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1528, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.1, atol=0.00231149

Mismatched elements: 4186048 / 4194304 (99.8%)
Max absolute difference: 492620.22392237
Max relative difference: 39550830.05759069
 x: ndarray([[ 138665.97 , -126715.72 ,   51470.453, ...,   55498.96 ,
           -86675.89 ,  -28756.328],
         [ 138883.81 ,  -90019.63 ,   17722.49 , ...,   15536.704,...
 y: array([[ 2017.276442, -1581.97619 ,  4346.270132, ..., -3447.755696,
        -1915.185442,  3234.148534],
       [-5782.420535,  3139.004095,  2111.837459, ...,  -561.818599,...

======================================================================
FAIL: test_c2r_2D_in_4D_dims23 (test_fft.TestFFT)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/qblocks/bifrost/test/test_fft.py", line 239, in test_c2r_2D_in_4D_dims23
    self.run_test_c2r(self.shape4D, [2, 3])
  File "/home/qblocks/bifrost/test/test_fft.py", line 144, in run_test_c2r
    self.run_test_c2r_impl(shape, axes, fftshift=True)
  File "/home/qblocks/bifrost/test/test_fft.py", line 136, in run_test_c2r_impl
    compare(odata.copy('system'), known_result)
  File "/home/qblocks/bifrost/test/test_fft.py", line 51, in compare
    np.testing.assert_allclose(result, gold, rtol=RTOL, atol=MTOL * absmean)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1528, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.1, atol=3.55664e-05

Mismatched elements: 898 / 1048576 (0.0856%)
Max absolute difference: 80.47305827
Max relative difference: 285.56652115
 x: ndarray([[[[ 2.902072e+01,  1.518092e+01,  2.862135e+01, ...,
            -7.182214e+01, -3.217509e+01,  3.074002e+01],
           [ 7.172186e+01,  2.634967e+01, -3.403542e+01, ...,...
 y: array([[[[ 2.902073e+01,  1.518093e+01,  2.862135e+01, ...,
          -7.182214e+01, -3.217510e+01,  3.074002e+01],
         [ 7.172186e+01,  2.634967e+01, -3.403541e+01, ...,...

======================================================================
FAIL: test_c2r_3D (test_fft.TestFFT)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/qblocks/bifrost/test/test_fft.py", line 204, in test_c2r_3D
    self.run_test_c2r(self.shape3D, [0, 1, 2])
  File "/home/qblocks/bifrost/test/test_fft.py", line 144, in run_test_c2r
    self.run_test_c2r_impl(shape, axes, fftshift=True)
  File "/home/qblocks/bifrost/test/test_fft.py", line 136, in run_test_c2r_impl
    compare(odata.copy('system'), known_result)
  File "/home/qblocks/bifrost/test/test_fft.py", line 51, in compare
    np.testing.assert_allclose(result, gold, rtol=RTOL, atol=MTOL * absmean)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1528, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.1, atol=0.00163087

Mismatched elements: 2077666 / 2097152 (99.1%)
Max absolute difference: 77974.12088231
Max relative difference: 6462494.94533914
 x: ndarray([[[  9408.586  ,  -2370.62   ,    346.86743, ...,   5139.675  ,
            -4945.5244 ,   6813.3193 ],
          [ -3210.2842 ,   2198.6216 ,   6312.9434 , ...,  19694.162  ,...
 y: array([[[-4155.18988 , -1755.902762,  -279.50043 , ..., -2127.643956,
           772.828903,   563.93434 ],
        [ 1789.305557,  -574.083396,   525.70141 , ...,  1627.573228,...

----------------------------------------------------------------------

league commented 2 years ago

Yep, there were other failures I hadn't gotten around to analyzing the logs more carefully or running with debug. Also had issues in test_linalg and test_pipeline.

2022-03-14_13-09

league commented 2 years ago

BTW, might be an idea to store the value(s) used for --with-gpu-archs somewhere/somehow that's accessible to python -m bifrost.version? Or even to capture that for telemetry.

jaycedowell commented 2 years ago

I think that is already done since everything goes into config.h. That should be picked up by ctypesgen and make its way into libbifrost_generated.py.

To answer the other question about putting the archs. in telemetry: I'm not sure. It would be interesting to know but I'm not sure how critical it is for planning.

jaycedowell commented 2 years ago

Back to tests. test_linalg ran without errors more me. I get one error on test_pipeline related to FDMT. The error seems familiar but I thought I had already fixed something related.

jaycedowell commented 2 years ago

5c766fc takes care of the test_pipeline error.

jaycedowell commented 2 years ago

Everything passes now except for test_fft. Drilling down into the test suite the c2r tests have two parts: one with fftshift=False and another with fftshift=True. For fftshift=False I only get one failure:

======================================================================
FAIL: test_c2r_1D (test_fft.TestFFT)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/qblocks/bifrost/test/test_fft.py", line 201, in test_c2r_1D
    self.run_test_c2r(self.shape1D, [0])
  File "/home/qblocks/bifrost/test/test_fft.py", line 144, in run_test_c2r
    self.run_test_c2r_impl(shape, axes)
  File "/home/qblocks/bifrost/test/test_fft.py", line 137, in run_test_c2r_impl
    compare(odata.copy('system'), known_result)
  File "/home/qblocks/bifrost/test/test_fft.py", line 52, in compare
    np.testing.assert_allclose(result, gold, rtol=RTOL, atol=MTOL * absmean)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1528, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.1, atol=0.00462357

Mismatched elements: 1 / 16777216 (5.96e-06%)
Max absolute difference: 0.01639435
Max relative difference: 1.93911746
 x: ndarray([  374.53613, -5428.0957 ,  -363.20837, ..., -3117.7966 ,
         -3706.6025 ,  2153.2935 ], dtype=float32)
 y: array([  374.536103, -5428.096355,  -363.208565, ..., -3117.800304,
       -3706.602722,  2153.296084])

which is "only" one element off. All of the other errors in the test suite come from the fftshift=True tests. This could be a cuFFT bug.

jaycedowell commented 2 years ago

If I switch my qblocks instance to CUDA 10.0 the fftshift=True errors go away and I am left with only one c2r failure (the single element difference one). I think this is related to https://github.com/arrayfire/arrayfire/issues/2518 which has similar symptoms. This input overwriting for c2r could be interacting with the FFT shift.

jaycedowell commented 2 years ago

Related, under CUDA 10.0 with a RTX3090 the configure arch. auto-detection fails since nvcc doesn't support 80+. We should probably try to catch this case as well.

jaycedowell commented 2 years ago

0bf33ef deals with the above comment. @league I think that is everything now if you want to try testing again.

league commented 2 years ago

Wow, nice work. Should be able to try tomorrow afternoon.

jaycedowell commented 2 years ago

We still need to come up with some kind of plan for what to do about CUDA 10.1-11.0+(?). Maybe skip those tests? Maybe issue a warning at configure time so warn people about the expected behavior? I'm not sure.

jaycedowell commented 2 years ago

083c370 adds a "cuFFT complex-to-real transform" warning if using CUDA 10.1, 10.2, or 11.0.

league commented 2 years ago

Right now I'm testing on colab first, and everything looks good with test_map and test_transpose with these latest commits. I'm still seeing some issues with test_linalg though, and maybe test_pipeline (not on FDMT). I'll take a minute to collect and verify those results. (New issue maybe?)

For the cuFFT C2R issue, can we presume it starts working again with CUDA 11.1+ ? I think the qblocks VMs I've had are either 10.1 or 11.0 (maybe can apt pin some other), but the colab I have now is 11.1. If it's specific versions that are known not to work, then I think a precisely-targeted @unittest.skipIf is appropriate, to go along with that configure warning.

jaycedowell commented 2 years ago

Hard to say if 11.1+ is ok. I'm looking through the latest cuFFT release notes and I see the input overwriting listed as a known issue in 11.1 with no apparent resolution in later releases.

league commented 2 years ago

I'd say almost everything in this thread has been addressed by 1e47784f4a83ceb1dfe3696e1773cb934ccf9a5f, 5c766fcd91af23190252e5b4acb112ce17a2d3b4, 0bf33ef5cf6d314b7ecd770ed9c7c1f4f21e46a1, 083c3706f4df71c2021bce14cda3a6572f14ed58, and c69897399de1546a417b2002ec341ec071b2f1d5.

The test suite as a whole is much more reliable, on qblocks at least. (Still seing other failures on Colab, that I'll post separately.) Just ran it on qblocks with Tesla V100, CUDA 11.0, auto-detected arch 70. Only outstanding issues are the cuFFT c2r.

I think marking those as skipped is the right thing. We'd like a test suite that passes as cleanly as possible, on as many configurations as possible… even if there need to be some "asterisks" in the form of skipped tests. For clarity, let's try to ensure that exactly the platforms that configure warns about are the ones where tests are skipped. Is it worth propagating something like @CUDA_VERSION@ into config.h and then displaying it with bifrost.version --config and also using it in unittest.skipIf? Or even set @CUDA_FFT_C2R_IS_DODGY@ if we don't want to duplicate the list of failing versions between configure.ac and test_fft.py? :smile:

jaycedowell commented 2 years ago

My preference would be to have some like a@CUDA_VERSION@ since this isn't the first version-dependent problem to hit a library. At least in CUDA nvcc defines a few things that we could just use to populate a BF_CUDA_VERSION in config.h. For the configure script I would probably stick to what is in cuda.m4 for now unless there is an easy way to compile/execute something that prints the version. Maybe that could even be rolled into the existing "does CUDA work" check.

jaycedowell commented 2 years ago

ed73627 should skip the fftshift=True portion of the c2r tests.

jaycedowell commented 2 years ago

Here's a confusing data point: under CUDA 10.1 and a GTX 980 the c2r + shift tests run without error. So maybe the details of what works/doesn't work with the FFT module is a combination of both CUDA version and the architectures being targeted. Maybe the expected failure approach is the right one to take since you also learn if things work when you don't expect them to.

league commented 2 years ago

I'm ready to close this. I didn't realize the "expected failure" and "unexpected success" and so on would make it all the way into the summary, this is really great. Right now I'm seeing

Ran 251 tests in 378.071s
FAILED (failures=11, errors=2, skipped=6, expected failures=1, unexpected successes=2)

Those failures are all in RomeinTest and it seems different from anything described here.

ledatelescope / bifrost

BF_STATUS_INVALID_ARGUMENT in TestMap and TransposeTest suite #164