ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
66 stars 29 forks source link

`test_romein` failures #172

Closed jaycedowell closed 2 years ago

jaycedowell commented 2 years ago

On the test machine I'm getting a variety of failures on the test_romein suite. I haven't run into these before so I'm wondering if it has something to do with the GPU/version of CUDA that we are using on the test machine (RTX A4000; arch. 86; CUDA 11.2)?

======================================================================
FAIL: test_ntime2_nchan2_npol2_gridsize128_illumsize3_datasize256 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 275, in test_ntime2_nchan2_npol2_gridsize128_illumsize3_datasize256
    self.run_test(grid_size=128, illum_size=3, data_size=256, ntime=2, npol=2, nchan=2,polmajor=False)
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 17030 / 131072 (13%)
Max absolute difference: 5.9955316
Max relative difference: 15.672092
 x: ndarray([[[[[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
            [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
            [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],...
 y: array([[[[[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
          [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
          [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],...

======================================================================
FAIL: test_ntime2_nchan2_npol2_gridsize64_illumsize7_datasize256 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 271, in test_ntime2_nchan2_npol2_gridsize64_illumsize7_datasize256
    self.run_test(grid_size=64, illum_size=7, data_size=256, ntime=2, npol=2, nchan=2,polmajor=False)
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 80 / 32768 (70.4%)
Max absolute difference: 11.11163
Max relative difference: 109.149864
 x: ndarray([[[[[0.      +0.j      , 0.      +0.j      ,
             0.      +0.j      , ..., 0.      +0.j      ,
             0.      +0.j      , 0.      +0.j      ],...
 y: array([[[[[0.      +0.j      , 0.      +0.j      , 0.      +0.j      ,
           ..., 0.      +0.j      , 0.      +0.j      ,
           0.      +0.j      ],...

======================================================================
FAIL: test_ntime8_nchan2_npol3_gridsize64_illumsize3_datasize256 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 267, in test_ntime8_nchan2_npol3_gridsize64_illumsize3_datasize256
    self.run_test(grid_size=64, illum_size=3, data_size=256, ntime=8, npol=3, nchan=2, polmajor=False)
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 98497 / 196608 (50.1%)
Max absolute difference: 8.043615
Max relative difference: 268.03442
 x: ndarray([[[[[ 0.      +0.j      ,  0.      +0.j      ,
              0.      +0.j      , ...,  0.      +0.j      ,
              0.      +0.j      ,  0.      +0.j      ],...
 y: array([[[[[ 0.      +0.j      ,  0.      +0.j      ,
            0.      +0.j      , ...,  0.      +0.j      ,
            0.      +0.j      ,  0.      +0.j      ],...

======================================================================
FAIL: test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 279, in test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256
    self.run_test(grid_size=32, illum_size=3, data_size=256, ntime=32, npol=2, nchan=2,polmajor=False)
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 91810 / 131072 (70%)
Max absolute difference: 12.514372
Max relative difference: 199.0844
 x: ndarray([[[[[ 0.      +0.j      ,  0.      +0.j      ,
              0.      +0.j      , ...,  0.      +0.j      ,
              0.      +0.j      ,  0.      +0.j      ],...
 y: array([[[[[ 0.      +0.j      ,  0.      +0.j      ,
            0.      +0.j      , ...,  0.      +0.j      ,
            0.      +0.j      ,  0.      +0.j      ],...

======================================================================
FAIL: test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_ci16 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 287, in test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_ci16
    self.run_test(grid_size=32, illum_size=3, data_size=256, ntime=32, npol=2, nchan=2,polmajor=False, dtype='ci16')
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 85886 / 131072 (65.5%)
Max absolute difference: 11.401754
Max relative difference: 8.
 x: ndarray([[[[[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,
              0.+0.j],
            [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,...
 y: array([[[[[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
          [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
          [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],...

======================================================================
FAIL: test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_ci32 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 289, in test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_ci32
    self.run_test(grid_size=32, illum_size=3, data_size=256, ntime=32, npol=2, nchan=2,polmajor=False, dtype='ci32')
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 85886 / 131072 (65.5%)
Max absolute difference: 11.401754
Max relative difference: 8.
 x: ndarray([[[[[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,
              0.+0.j],
            [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,...
 y: array([[[[[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
          [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
          [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],...

======================================================================
FAIL: test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_ci4 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 283, in test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_ci4
    self.run_test(grid_size=32, illum_size=3, data_size=256, ntime=32, npol=2, nchan=2,polmajor=False, dtype='ci4')
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 72834 / 131072 (55.6%)
Max absolute difference: 9.899495
Max relative difference: 6.4031[243](https://github.com/ledatelescope/bifrost/runs/5800290936?check_suite_focus=true#step:9:243)
 x: ndarray([[[[[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,
              0.+0.j],
            [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,...
 y: array([[[[[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
          [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
          [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],...

======================================================================
FAIL: test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize[256](https://github.com/ledatelescope/bifrost/runs/5800290936?check_suite_focus=true#step:9:256)_ci8 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 285, in test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_ci8
    self.run_test(grid_size=32, illum_size=3, data_size=256, ntime=32, npol=2, nchan=2,polmajor=False, dtype='ci8')
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 85886 / 131072 (65.5%)
Max absolute difference: 11.401754
Max relative difference: 8.
 x: ndarray([[[[[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,
              0.+0.j],
            [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,...
 y: array([[[[[ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
          [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],
          [ 0.+0.j,  0.+0.j,  0.+0.j, ...,  0.+0.j,  0.+0.j,  0.+0.j],...

======================================================================
FAIL: test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_out_cf64 (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 281, in test_ntime_32_nchan2_npol2_gridsize32_illumsize3_datasize256_out_cf64
    self.run_test(grid_size=32, illum_size=3, data_size=256, ntime=32, npol=2, nchan=2,polmajor=False,otype=numpy.complex128)
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 167, in run_test
    numpy.testing.assert_allclose(grid,
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-10, atol=1e-11

Mismatched elements: 91810 / 131072 (70%)
Max absolute difference: 12.51437095
Max relative difference: 199.08439399
 x: ndarray([[[[[ 0.      +0.j      ,  0.      +0.j      ,
              0.      +0.j      , ...,  0.      +0.j      ,
              0.      +0.j      ,  0.      +0.j      ],...
 y: array([[[[[ 0.      +0.j      ,  0.      +0.j      ,
            0.      +0.j      , ...,  0.      +0.j      ,
            0.      +0.j      ,  0.      +0.j      ],...

======================================================================
FAIL: test_set_kernels (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 293, in test_set_kernels
    self.run_kernel_test(grid_size=64, illum_size=3, data_size=256, ntime=8, npol=3, nchan=2,polmajor=False)
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 216, in run_kernel_test
    numpy.testing.assert_allclose(grid, gridnaive, 1e-4, 1e-5)
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 98497 / 196608 (50.1%)
Max absolute difference: 16.08723
Max relative difference: 268.03442
 x: ndarray([[[[[ 0.      +0.j      ,  0.      +0.j      ,
              0.      +0.j      , ...,  0.      +0.j      ,
              0.      +0.j      ,  0.      +0.j      ],...
 y: array([[[[[ 0.      +0.j      ,  0.      +0.j      ,
            0.      +0.j      , ...,  0.      +0.j      ,
            0.      +0.j      ,  0.      +0.j      ],...

======================================================================
FAIL: test_set_positions (test_romein.RomeinTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line 297, in test_set_positions
    self.run_positions_test(grid_size=64, illum_size=3, data_size=256, ntime=8, npol=3, nchan=2,polmajor=False)
  File "/home/docker/actions-runner/_work/bifrost/bifrost/test/test_romein.py", line [262](https://github.com/ledatelescope/bifrost/runs/5800290936?check_suite_focus=true#step:9:262), in run_positions_test
    numpy.testing.assert_allclose(grid, gridnaive, 1e-4, 1e-5)
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/home/docker/actions-runner/_work/_tool/Python/3.8.12/x64/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.0001, atol=1e-05

Mismatched elements: 98497 / 196608 (50.1%)
Max absolute difference: 8.043615
Max relative difference: [268](https://github.com/ledatelescope/bifrost/runs/5800290936?check_suite_focus=true#step:9:268).0[344](https://github.com/ledatelescope/bifrost/runs/5800290936?check_suite_focus=true#step:9:344)2
 x: ndarray([[[[[ 0.      +0.j      ,  0.      +0.j      ,
              0.      +0.j      , ...,  0.      +0.j      ,
              0.      +0.j      ,  0.      +0.j      ],...
 y: array([[[[[ 0.      +0.j      ,  0.      +0.j      ,
            0.      +0.j      , ...,  0.      +0.j      ,
            0.      +0.j      ,  0.      +0.j      ],...
jaycedowell commented 2 years ago

Yeah, 11 of 17 tests are failing. For what is it worth the ones that fail are the polmajor=False ones.

league commented 2 years ago

I think I have seen these once before, probably on qblocks. I may have a log file somewhere with some details including architectures. But it didn't happen on most of their machines.

league commented 2 years ago

Update: I don't have the log file for when romein failed… just fond memories. 😏 Now I wonder if maybe it was on google colab.

jaycedowell commented 2 years ago

After some digging it looks like the polmajor=False failures are related to this change to bifrost.ndarray that I made. I'm not sure why this is a problem,

jaycedowell commented 2 years ago

More digging shows that the problem is with bifrost.ndarray.copy(). Specifically it does not check if the array is C contiguous before it copies. It only assumes that it is.

jaycedowell commented 2 years ago

Maybe it's not so clear cut. Adding strides=... to bifrost.ndarray does seem to be the right thing to do but I'm having trouble reproducing the full polmajor=False error with a simple numpy/Bifrost comparison.

jaycedowell commented 2 years ago

No, it is clear what is going on and there are two problems:

  1. There are several modules in Bifrost that make an assumption that the data are C-ordered with strides[i] > strides[i+1] and
  2. bifrost.ndarray.copy() doesn't work that same as numpy.ndarray.copy(). numpy.ndarray.copy() will make things C-ordered if they are not already and bifrost.ndarray.copy() keeps whatever memory layout was there.

(1) wouldn't be so much of an issue if (2) wasn't also happening.

jaycedowell commented 2 years ago

Closing with the merge of #174.