Open torrance opened 1 year ago
I'll try to find some time this week to look into this.
I spent some time looking into this and I think I know what is happening for this particular case. I think the underlying problem is how Bifrost computes array sizes. By default it uses the product of the first array stride length with the size of the first array dimension to get the overall array size. In Python that would be:
from bifrost import ndarray
a = ndarray(shape=(1,1,6), dtype='ci8')
print(a.size*a.itemsize, 'vs', a.strides[0]*a.shape[0])
which looks ok. When misalign
is non-zero we end up with:
b = a[...,2:]
print(b.size*b.itemsize, 'vs', b.strides[0]*b.shape[0])
which doesn't match. The misalign=2
causes the start address for the underlying memory and the shape to be updated but not the strides.
print(a.ctypes.data, 'vs', b.ctypes.data)
print(a.shape, 'vs', b.shape)
print(a.strides, 'vs', b.strides)
In this case the correct array size would come from something like numpy.prod(b.shape)*b.itemsize
since both of those should be available in the BFarray class/struct.
I'm less sure what to do about it. In this particular case changing the calculation of A_nelement_total
to use shape
(somehow) would seem like the thing do to. However, I doubt that is a general solution since A_nelement_total
may be only relevant for the batched size of the operation. Thinking about what is happening here we are trying to do c = α a.aH + βc. Here a
is a matrix with shape[^1](nbatch, N, K)
and c
has shape (nbatch, N, N)
. That would make nbatch
the number of independent pieces of a.aH to run, N
sets the number of elements in the last two dimensions of c
, and K
is the number of elements to sum over each element in c
for the matrix multiplication. I think that makes A_nelement_total
the maximum of "the distance between rows of a
" and "the distance between batched sections of a
" added to "the offset between the start of a
and the texture alignment"?
[^1]: The shape is technically (nbatch, K, N)
because the original LinAlg.matmul()
call puts a
in the location of b
which says what we are passing in is the Hermitian conjugate of a
.
Thank you for looking at this @jaycedowell . I'll re-read your comment a few times and attempt to have another look at it next week.
Hi @jaycedowell
I think the following gives the correct value for A_nelement_total
:
size_t A_nelement_total =
(A_offset + N) // the elements in the first row of first batch
+ (K - 1) * A_stride // the elements in the rest of the first batch
+ (nbatch - 1) * A_batchstride; // the elements for the remaining batches
The tests pass and I no longer get memory violations in ROCM.
Do you think the logic works now for all use cases?
I'm not sure that A_nelement_total
should include elements from other batches (that last term in the sum). For the batched execution you are only really concerned with those elements that are used for that particular batch. Each batch should handle its own texture binding.
I think it needs to be sized to store all the batches. The texture isn't created once per batch, it's shared amongst all of them, from my reading of the code.
In any case, if I delete the last term in the sum the tests will fail again.
The
test_linalg/test_matmul_ab_beamformer_kernel_small
test adds amisalign
parameter to testlinalg_kernels.cu
handling of memory not aligned to 512 byte boundaries:https://github.com/ledatelescope/bifrost/blob/30fd2682ec22bc26d9dd8cd9bc424aefc4e7075e/test/test_linalg.py#L224-L231
For example, on the third innermost iteration, this tests passes in the values
(nchan = 1, ntime = 1, nstand = 3, misalign = 2)
. This produces an array with size(1, 1, 6)
, which is then truncated (using the misalign value) down to a(1, 1, 4)
.After calling
linalg.matmul
, this eventually reaches the functionbf_cherk_N()
where the total elements in the array needs to be calculated:https://github.com/ledatelescope/bifrost/blob/30fd2682ec22bc26d9dd8cd9bc424aefc4e7075e/src/linalg_kernels.cu#L539-L540
For this particular test, the
n_element_total
should be2 (size of truncated array) + 1 (offset)
, however it calculates it as3 + 1 = 4
.When this incorrect, larger size is later used to calculate the
sizeInBytes
of the texture, this results in the texture believe it has access to more memory than it really does. On CUDA backends this isn't actually validated, but on ROCM backends, the texture creation fails since it actually checks the backing memory.I've spent some time with this but haven't been able to fix it, as I'm not clear what all of N, K, A_stride, nbatch, etc. are, or where they are meant to be coming from.