icl-utk-edu / slate

SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) systems. It is developed as part of the U.S. Department of Energy Exascale Computing Project (ECP).
https://icl.utk.edu/slate/
BSD 3-Clause "New" or "Revised" License
91 stars 21 forks source link

MOSI cleanup #113

Closed neil-lindquist closed 1 year ago

neil-lindquist commented 1 year ago

This PR works to start cleaning up MOSI and the organization of BaseMatrix and MatrixStorage.

mgates3 commented 1 year ago

CI has an error in test_lq for CUDA on leconte:

 ./test_lq
corrupted double-linked list
 *** Process received signal ***
 Signal: Aborted (6)
 Signal code:  (-6)
 [ 0] /lib64/libc.so.6(+0x54df0)[0x7f60ffa06df0]
 [ 1] /lib64/libc.so.6(+0xa154c)[0x7f60ffa5354c]
 [ 2] /lib64/libc.so.6(raise+0x16)[0x7f60ffa06d46]
 [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f60ff9da7f3]
 [ 4] /lib64/libc.so.6(+0x29130)[0x7f60ff9db130]
 [ 5] /lib64/libc.so.6(+0xab617)[0x7f60ffa5d617]
 [ 6] /lib64/libc.so.6(+0xac16c)[0x7f60ffa5e16c]
 [ 7] /lib64/libc.so.6(+0xad1cb)[0x7f60ffa5f1cb]
 [ 8] /lib64/libc.so.6(free+0x55)[0x7f60ffa61955]
 [ 9] ./test_lq(_ZN9__gnu_cxx13new_allocatorISt7complexIfEE10deallocateEPS2_m+0x2f)[0x4d8aa9]
 [10] ./test_lq(_ZNSt16allocator_traitsISaISt7complexIfEEE10deallocateERS2_PS1_m+0x2b)[0x4d2eb0]
 [11] ./test_lq(_ZNSt12_Vector_baseISt7complexIfESaIS1_EE13_M_deallocateEPS1_m+0x32)[0x4cc402]
 [12] ./test_lq(_ZNSt12_Vector_baseISt7complexIfESaIS1_EED2Ev+0x3e)[0x4c532c]
 [13] ./test_lq(_ZNSt6vectorISt7complexIfESaIS1_EED1Ev+0x41)[0x4bd235]
 [14] ./test_lq[0x4b8f1e]
 [15] ./test_lq[0x4ae88b]
 [16] ./test_lq[0x4abc07]
 [17] ./test_lq[0x50e5c9]
 [18] ./test_lq[0x4ac174]
 [19] ./test_lq[0x50eced]
 [20] ./test_lq[0x4ac3b5]
 [21] /lib64/libc.so.6(+0x3feb0)[0x7f60ff9f1eb0]
 [22] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f60ff9f1f60]
 [23] ./test_lq[0x4aabd5]
FAILED : exit code -6

Any ideas?

neil-lindquist commented 1 year ago

I think that's referencing to the linked list that manages memory allocations. So, I'm guessing something wrote past the end of it's valid memory, but I don't know what. I'll run valgrind and see what shows up.

neil-lindquist commented 1 year ago

The tests and unit tests seem to be working fine for both 1 node and 4 nodes. The CI failure is QR, which is known to sporadically fail. I'm working on checking a few routines with big matrices on Frontier.

But, if there's nothing else, could you mark this PR as approved @mgates3?