Closed telegraphic closed 1 year ago
After some digging I think I know what is going on. It looks like the problem with the dedisp
slice in the bf.reduce
call is that the memory isn't contiguous along the reduction axis. However, bf.reduce
is treating it like it is and tries to launch a vectorized reduction kernel that ends up failing.
The quick fix is to set all of the use_vec#_kernel
flags in reduce.cu to false
if the input array is not contiguous to force using the non-vectorized loop kernel. That will have some performance impacts on the reduction for anything that is non-contiguous but it should be robust. This fix might be a little heavy handed, too, since it really seems to only be the structure along the reduction axis that matters.
@telegraphic Does slice-with-reduce solve this for you?
Minimal example:
When operating on sliced bifrost.ndarrays in CUDA space, we have been running into a BF_STATUS_DEVICE_ERROR exception (and BF_STATUS_MEM_OP_FAILED / BF_STATUS_INTERNAL_ERROR).
Here is a minimal example:
A 'BF_STATUS_DEVICE_ERROR' occurs, when the ‘new_td’ is copied to 'td' in cuda space for all ‘i’ values (line 24) and when the reduction factor (line 29) is lower than 8. It works regardless of the copying for reduction factors of 8 and higher.
If new_td is copied to td once, everything works fine. Once new_td is copied more than once (i.e., the data in CUDA space is replaced - even if it is replaced by the same numbers), the exception is raised.
(attempting to access
dedisp
gives):