Closed Nanthini10 closed 3 years ago
Thanks for opening the issue. Here's my explanation for what happens. It looks like multiple RAFT utility functions make use of the matrixVectorOp
operation. The matrixVectorOp
will, under most conditions, opt for operations to run on 16 bytes aligned addresses. However, it does not check memory alignment. This is because RMM allocations are supposed to be aligned by default.
Instead it will check other conditions :
1) That the type has at most 16 bytes -> float64
has 8 bytes
2) That the number of bytes in a stride is a multiple of 16 -> (10 8) % 16 == 0 but (11 8) % 16 != 0 (here the stride is the number of rows)
If the conditions are satisfied it will launch the operation on 16 bytes despite the fact that the memory may not be correctly aligned for it, causing the error. If the conditions are unmet it will launch the operations on 8 bytes which should always work fine as memory should always be aligned on 8 bytes with float64
data.
To follow-up on this and add to what @viclafargue is describing- the issue arises specifically because matrixvectorop
will promote types to lower the number of reads (e.g. reading a single 16-byte element instead of 2x 8-byte elements). However when it does this promotion, it was not properly checking that the given pointer was aligned to the size of the promoted type, hence the alignment error. I've reopened and reviewed the original PR so that we can work it into 21.10 (pending the addition of better documentation outlining the justification for the promotion and the need to check the alignment).
As @viclafargue pointed out to me, this was always working on a fresh newly allocated pointer (e.g. using series.copy()
) because RMM's alignment to 256 is already a multiple of both 8- and 16-bytes.
@Nanthini10, since you've already verified Victor's PR does in fact fix the errors we were encounering, this issue can be closed now.
Describe the bug Related to https://github.com/rapidsai/cuml/issues/4199 It looks like when a view of a cuDF series passed through cuML's LinearRegression fit call, the memory seems to be misaligned. Code to reproduce it is below.
But, there are certain cases where using the view works. It does not error:
lr.fit(X, y.copy())
y
is passed as a DataFramestart=2(or any even number, in this case)
andend=start+10
start=1
andend=12
fit_intercept=False
, the code works, which leads us to believe the error is coming from raft/linalg/matrix_vector_op.cuh but it's still not clear why it happens only with a pointer offset, and not all cases. From the debugging so far, the pointers seem to be aligned properly in the Python layer.cc: @cjnolet @dantegd @viclafargue @beckernick for visibility.
Steps/Code to reproduce bug
Expected behavior The shouldn't be a memory misalignment.
Environment overview (please complete the following information)
docker pull rapidsai/rapidsai-core-dev-nightly:21.10-cuda11.0-devel-ubuntu18.04-py3.8
Environment details
Click here to see environment details
Additional context Finally, this https://github.com/rapidsai/raft/pull/325 fixes it, but there might be an underlying issue this could be masking, so it might warrant further investigation.