ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
64 stars 29 forks source link

bug: BF_STATUS_DEVICE_ERROR when using a slice #155

Closed telegraphic closed 1 year ago

telegraphic commented 2 years ago

Minimal example:

When operating on sliced bifrost.ndarrays in CUDA space, we have been running into a BF_STATUS_DEVICE_ERROR exception (and BF_STATUS_MEM_OP_FAILED / BF_STATUS_INTERNAL_ERROR).

Here is a minimal example:

import bifrost as bf
import numpy as np
from bifrost.ndarray import copy_array
DEDISP_KERNEL = """
// All inputs have axes (beam, frequency, time)
// input i (the data) has shape 5, 512, 3x8192
// time delay td (the frequency-dependent offset to the first time sample to select) has shape (1, 512, 1)
// Compute o = i shifted by td and averaged by a factor of 1
// The shape of the output is (5, 512, 3x8192)
// we have defined the axis names as t, b, ft, f
o(b, f, ft) = (i(b, 2*f, (ft + td(1, 2*f, 1))) + i(b, 1 + 2*f, (ft + td(1, 1 + 2*f, 1))) ) / 2;
"""
x = np.random.normal(0, 1, (20, 512, 2048)).astype(np.float32)
test = bf.ndarray(x, space = 'cuda')
reduced = bf.ndarray(shape = (5, 256, 128), dtype = np.float32, space = 'cuda')
dedisp = bf.ndarray(shape = (5, 256, 2048), dtype = np.float32, space = 'cuda')
td = bf.ndarray(shape = (1, 512, 1), dtype = np.uint8, space = 'cuda')
for i in range(20):
  new_td = np.full((1, 512, 1), 5, dtype = np.uint8)
  if i==0:
    copy_array(td, new_td)
  if i < 15:
    bf.map(DEDISP_KERNEL, data={'o': dedisp, 'i': test[i:i+5, :, :], 'td': td}, axis_names = ['b', 'f', 'ft'], shape = (5, 256, 2048))
    start = i 
    stop = i + 512 
    bf.reduce(dedisp[:, :,start:stop ], reduced, op = 'mean')

A 'BF_STATUS_DEVICE_ERROR' occurs, when the ‘new_td’ is copied to 'td' in cuda space for all ‘i’ values (line 24) and when the reduction factor (line 29) is lower than 8. It works regardless of the copying for reduction factors of 8 and higher.

If new_td is copied to td once, everything works fine. Once new_td is copied more than once (i.e., the data in CUDA space is replaced - even if it is replaced by the same numbers), the exception is raised.

(attempting to access dedisp gives):

In [3]: dedisp
Out[3]: ---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/mpy3/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/mpy3/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395
    396             return _default_pprint(obj, self, cycle)

~/mpy3/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

~/install/bifrost/python/bifrost/ndarray.py in __repr__(self)
    343             return self.copy(space='system')
    344     def __repr__(self):
--> 345         return super(ndarray, self._system_accessible_copy()).__repr__()
    346     def __str__(self):
    347         return super(ndarray, self._system_accessible_copy()).__str__()

~/install/bifrost/python/bifrost/ndarray.py in _system_accessible_copy(self)
    341             return self
    342         else:
--> 343             return self.copy(space='system')
    344     def __repr__(self):
    345         return super(ndarray, self._system_accessible_copy()).__repr__()

~/install/bifrost/python/bifrost/ndarray.py in copy(self, space, order)
    361             space = self.bf.space
    362         # Note: This makes an actual copy as long as space is not None
--> 363         return ndarray(self, space=space)
    364     def _key_returns_scalar(self, key):
    365         # Returns True if self[key] would return a scalar (i.e., not a view)

~/install/bifrost/python/bifrost/ndarray.py in __new__(cls, base, space, shape, dtype, buffer, offset, strides, native, conjugated)
    197                                       native=base.bf.native,
    198                                       conjugated=conjugated)
--> 199                 copy_array(obj, base)
    200         else:
    201             # Create new array

~/install/bifrost/python/bifrost/ndarray.py in copy_array(dst, src)
    109     else:
    110         _check(_bf.bfArrayCopy(dst_bf.as_BFarray(),
--> 111                                src_bf.as_BFarray()))
    112         if dst_bf.bf.space != src_bf.bf.space:
    113             # TODO: Decide where/when these need to be called

~/install/bifrost/python/bifrost/libbifrost.py in _check(status)
    116             else:
    117                 status_str = _bf.bfGetStatusString(status)
--> 118                 raise RuntimeError(status_str)
    119     else:
    120         if status == _bf.BF_STATUS_END_OF_DATA:

RuntimeError: b'BF_STATUS_MEM_OP_FAILED'
jaycedowell commented 2 years ago

After some digging I think I know what is going on. It looks like the problem with the dedisp slice in the bf.reduce call is that the memory isn't contiguous along the reduction axis. However, bf.reduce is treating it like it is and tries to launch a vectorized reduction kernel that ends up failing.

The quick fix is to set all of the use_vec#_kernel flags in reduce.cu to false if the input array is not contiguous to force using the non-vectorized loop kernel. That will have some performance impacts on the reduction for anything that is non-contiguous but it should be robust. This fix might be a little heavy handed, too, since it really seems to only be the structure along the reduction axis that matters.

jaycedowell commented 2 years ago

@telegraphic Does slice-with-reduce solve this for you?