Closed neil-lindquist closed 1 year ago
CI has an error in test_lq for CUDA on leconte:
./test_lq
corrupted double-linked list
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)
[ 0] /lib64/libc.so.6(+0x54df0)[0x7f60ffa06df0]
[ 1] /lib64/libc.so.6(+0xa154c)[0x7f60ffa5354c]
[ 2] /lib64/libc.so.6(raise+0x16)[0x7f60ffa06d46]
[ 3] /lib64/libc.so.6(abort+0xd3)[0x7f60ff9da7f3]
[ 4] /lib64/libc.so.6(+0x29130)[0x7f60ff9db130]
[ 5] /lib64/libc.so.6(+0xab617)[0x7f60ffa5d617]
[ 6] /lib64/libc.so.6(+0xac16c)[0x7f60ffa5e16c]
[ 7] /lib64/libc.so.6(+0xad1cb)[0x7f60ffa5f1cb]
[ 8] /lib64/libc.so.6(free+0x55)[0x7f60ffa61955]
[ 9] ./test_lq(_ZN9__gnu_cxx13new_allocatorISt7complexIfEE10deallocateEPS2_m+0x2f)[0x4d8aa9]
[10] ./test_lq(_ZNSt16allocator_traitsISaISt7complexIfEEE10deallocateERS2_PS1_m+0x2b)[0x4d2eb0]
[11] ./test_lq(_ZNSt12_Vector_baseISt7complexIfESaIS1_EE13_M_deallocateEPS1_m+0x32)[0x4cc402]
[12] ./test_lq(_ZNSt12_Vector_baseISt7complexIfESaIS1_EED2Ev+0x3e)[0x4c532c]
[13] ./test_lq(_ZNSt6vectorISt7complexIfESaIS1_EED1Ev+0x41)[0x4bd235]
[14] ./test_lq[0x4b8f1e]
[15] ./test_lq[0x4ae88b]
[16] ./test_lq[0x4abc07]
[17] ./test_lq[0x50e5c9]
[18] ./test_lq[0x4ac174]
[19] ./test_lq[0x50eced]
[20] ./test_lq[0x4ac3b5]
[21] /lib64/libc.so.6(+0x3feb0)[0x7f60ff9f1eb0]
[22] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f60ff9f1f60]
[23] ./test_lq[0x4aabd5]
FAILED : exit code -6
Any ideas?
I think that's referencing to the linked list that manages memory allocations. So, I'm guessing something wrote past the end of it's valid memory, but I don't know what. I'll run valgrind and see what shows up.
The tests and unit tests seem to be working fine for both 1 node and 4 nodes. The CI failure is QR, which is known to sporadically fail. I'm working on checking a few routines with big matrices on Frontier.
But, if there's nothing else, could you mark this PR as approved @mgates3?
This PR works to start cleaning up MOSI and the organization of BaseMatrix and MatrixStorage.
BaseMatrix::tileState(int64_t i, int64_t j, int device, MOSI mosi)
since it's only purpose appears to be to putting MOSI in an illegal state (3f1afb)BaseMatrix<scalar_t>::releaseLocalWorkspaceTile
checked things thatBaseMatrix::tileRelease
already checks. (704972)tileGetFor*
routines taking sets of tiles duplicated the same preallocation code (350db3)