melisgl / mgl

Common Lisp machine learning library.
MIT License
585 stars 39 forks source link

Assorted errors during asdf:test-op on SBCL/Linux #8

Closed lukego closed 2 years ago

lukego commented 2 years ago

I'm seeing assorted errors when running (ASDF:OOS 'ASDF:TEST-OP :MGL) on SBCL 2.1.2 on Linux on a machine w/o CUDA (Thinkpad.)

Captured a few samples below. I'm not sure if these errors are local to the actual problem or if they are indirectly caused by earlier heap corruption.

Anyone have a suggestion how to troubleshoot this?

Unhandled memory fault at #x330179C9608.
   [Condition of type SB-SYS:MEMORY-FAULT-ERROR]

Restarts:
 0: [RETRY] Retry #<TEST-OP > on #<SYSTEM "mgl/test">.
 1: [ACCEPT] Continue, treating #<TEST-OP > on #<SYSTEM "mgl/test"> as having been successful.
 2: [RETRY] Retry ASDF operation.
 3: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration.
 4: [RETRY] Retry ASDF operation.
 5: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration.
 --more--

Backtrace:
  0: (SB-SYS:MEMORY-FAULT-ERROR #<unused argument> #.(SB-SYS:INT-SAP #X330179C9608))
  1: ("foreign function: call_into_lisp")
  2: ("foreign function: funcall2")
  3: ("foreign function: handle_memory_fault_emulation_trap")
  4: ("foreign function: #x417459")
  5: ((LAMBDA (MGL-CUBE:FACET) :IN MGL-CUBE:CALL-WITH-FACET*) #(15.549831909387535d0 233.24747864081303d0 77.74915954693768d0 295.44680627836317d0 8.0d0 186.59798291265042d0 ...))
      Locals:
        MGL-CUBE:FACET = #(15.549831909387535d0 233.24747864081303d0 77.74915954693768d0 295.44680627836317d0 8.0d0 186.59798291265042d0 ...)
        MGL-CUBE:FACET-NAME = MGL-MAT:FOREIGN-ARRAY
        MGL-MAT::FN = #<FUNCTION (LAMBDA (#:C24) :IN MGL-MAT::BLAS-DGEMM) {10179CB71B}>
        MGL-MAT:MAT = #<MGL-MAT:MAT 7x5 ABF #2A((15.549831909387535d0 233.24747864081303d0 77.74915954693768d0 295.44680627836317d0 8.0d0) ..)>
  6: ((:METHOD MGL-CUBE:CALL-WITH-FACET* (MGL-CUBE:CUBE T T T)) #<MGL-MAT::VEC 35 L {10179CA383}> MGL-MAT::LISP-VECTOR :IO #<FUNCTION (LAMBDA (MGL-CUBE:FACET) :IN MGL-CUBE:CALL-WITH-FACET*) {10179CBB2B}>) ..
  7: ((LAMBDA (MGL-CUBE:FACET) :IN MGL-CUBE:CALL-WITH-FACET*) #<unused argument>)
  8: ((:METHOD MGL-CUBE:CALL-WITH-FACET* (MGL-CUBE:CUBE T T T)) #<MGL-MAT:MAT 7x5 ABF #2A((15.549831909387535d0 233.24747864081303d0 77.74915954693768d0 295.44680627836317d0 8.0d0) (186.59798291265042d0 10..
  9: ((FLET CALL-NEXT-METHOD :IN "/home/luke/git/mgl-mat/src/mat.lisp") #<MGL-MAT:MAT 7x5 ABF #2A((15.549831909387535d0 233.24747864081303d0 77.74915954693768d0 295.44680627836317d0 8.0d0) (186.59798291265..
 10: ((:METHOD MGL-CUBE:CALL-WITH-FACET* (MGL-MAT:MAT (EQL (QUOTE MGL-MAT:FOREIGN-ARRAY)) T T)) #<MGL-MAT:MAT 7x5 ABF #2A((15.549831909387535d0 233.24747864081303d0 77.74915954693768d0 295.44680627836317d0..
 11: ((SB-PCL::SDFUN-METHOD MGL-CUBE:CALL-WITH-FACET*) #<unused argument> #<unused argument> #<MGL-MAT:MAT 7x5 ABF #2A((15.549831909387535d0 233.24747864081303d0 77.74915954693768d0 295.44680627836317d0 8...
 12: ((LAMBDA (#:B21) :IN MGL-MAT::BLAS-DGEMM) #<MGL-MAT:FOREIGN-ARRAY {10179CB673}>)
 13: ((LAMBDA (MGL-CUBE:FACET) :IN MGL-CUBE:CALL-WITH-FACET*) #(62.0d0 90.0d0 49.0d0 22.0d0 58.0d0 14.0d0 ...))
 14: ((:METHOD MGL-CUBE:CALL-WITH-FACET* (MGL-CUBE:CUBE T T T)) #<MGL-MAT::VEC 28 L {10179C9173}> MGL-MAT::LISP-VECTOR :INPUT #<FUNCTION (LAMBDA (MGL-CUBE:FACET) :IN MGL-CUBE:CALL-WITH-FACET*) {10179CB60B}..
 15: ((LAMBDA (MGL-CUBE:FACET) :IN MGL-CUBE:CALL-WITH-FACET*) #<unused argument>)
 16: ((:METHOD MGL-CUBE:CALL-WITH-FACET* (MGL-CUBE:CUBE T T T)) #<MGL-MAT:MAT 4x7 ABF #2A((62.0d0 90.0d0 49.0d0 22.0d0 58.0d0 14.0d0 ...) (91.0d0 61.0d0 4.0d0 37.0d0 4.0d0 62.0d0 ...) (93.0d0 75.0d0 44.0d0..
 17: ((FLET CALL-NEXT-METHOD :IN "/home/luke/git/mgl-mat/src/mat.lisp") #<MGL-MAT:MAT 4x7 ABF #2A((62.0d0 90.0d0 49.0d0 22.0d0 58.0d0 14.0d0 ...) (91.0d0 61.0d0 4.0d0 37.0d0 4.0d0 62.0d0 ...) (93.0d0 75.0d..
 18: ((:METHOD MGL-CUBE:CALL-WITH-FACET* (MGL-MAT:MAT (EQL (QUOTE MGL-MAT:FOREIGN-ARRAY)) T T)) #<MGL-MAT:MAT 4x7 ABF #2A((62.0d0 90.0d0 49.0d0 22.0d0 58.0d0 14.0d0 ...) (91.0d0 61.0d0 4.0d0 37.0d0 4.0d0 6..
 19: ((SB-PCL::SDFUN-METHOD MGL-CUBE:CALL-WITH-FACET*) #<unused argument> #<unused argument> #<MGL-MAT:MAT 4x7 ABF #2A((62.0d0 90.0d0 49.0d0 22.0d0 58.0d0 14.0d0 ...) (91.0d0 61.0d0 4.0d0 37.0d0 4.0d0 62.0..
 --more--
The assertion
(MGL-MAT::~= #1=(MGL-MAT:COERCE-TO-CTYPE 3)
             #2=(MGL-MAT:ASUM MGL-MAT::X))
failed with #1# = 3.0, #2# = 1.0.
   [Condition of type SIMPLE-ERROR]

Restarts:
 0: [CONTINUE] Retry assertion.
 1: [RETRY] Retry #<TEST-OP > on #<SYSTEM "mgl/test">.
 2: [ACCEPT] Continue, treating #<TEST-OP > on #<SYSTEM "mgl/test"> as having been successful.
 3: [RETRY] Retry ASDF operation.
 4: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration.
 5: [RETRY] Retry ASDF operation.
 --more--

Backtrace:
  0: (SB-KERNEL:ASSERT-ERROR (MGL-MAT::~= (MGL-MAT:COERCE-TO-CTYPE 3) (MGL-MAT:ASUM MGL-MAT::X)) 2 (MGL-MAT:COERCE-TO-CTYPE 3) 3.0 (MGL-MAT:ASUM MGL-MAT::X) 1.0)
  1: ((LAMBDA NIL :IN MGL-MAT::TEST-ASUM))
  2: (MGL-MAT:CALL-WITH-CUDA #<FUNCTION (LAMBDA NIL :IN MGL-MAT::TEST-ASUM) {53A27D6B}> :ENABLED NIL :DEVICE-ID 0 :RANDOM-SEED 1234 :N-RANDOM-STATES 4096 :OVERRIDE-ARCH-P T :N-POOL-BYTES NIL)
  3: (MGL-MAT::TEST-ASUM)
  4: (MGL-MAT::TEST)
  5: (MGL-TEST:TEST)
  6: (UIOP/PACKAGE:SYMBOL-CALL #:MGL-TEST #:TEST)
  7: ((:METHOD ASDF/ACTION:PERFORM (ASDF/LISP-ACTION:TEST-OP (EQL #<ASDF/SYSTEM:SYSTEM "mgl/test">))) #<unused argument> #<unused argument>) [fast-method]
  8: ((SB-PCL::EMF ASDF/ACTION:PERFORM) #<unused argument> #<unused argument> #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl/test">)
  9: ((LAMBDA NIL :IN ASDF/ACTION:CALL-WHILE-VISITING-ACTION))
 10: ((SB-PCL::SDFUN-METHOD ASDF/ACTION:PERFORM) #<unused argument> #<unused argument> #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl/test">)
 11: ((:METHOD ASDF/ACTION:PERFORM-WITH-RESTARTS :AROUND (T T)) #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl/test">) [fast-method]
 12: ((:METHOD ASDF/PLAN:PERFORM-PLAN (T)) #<ASDF/PLAN:SEQUENTIAL-PLAN {101658D813}>) [fast-method]
 13: ((FLET SB-C::WITH-IT :IN SB-C::%WITH-COMPILATION-UNIT))
 14: ((:METHOD ASDF/PLAN:PERFORM-PLAN :AROUND (T)) #<ASDF/PLAN:SEQUENTIAL-PLAN {101658D813}>) [fast-method]
 15: ((LAMBDA (SB-PCL::.ARG0. SB-INT:&MORE SB-PCL::.MORE-CONTEXT. SB-PCL::.MORE-COUNT.) :IN "/home/luke/quicklisp/setup.lisp") #<ASDF/PLAN:SEQUENTIAL-PLAN {101658D813}>)
 16: ((:METHOD ASDF/OPERATE:OPERATE (ASDF/OPERATION:OPERATION ASDF/COMPONENT:COMPONENT)) #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl"> :PLAN-CLASS NIL :PLAN-OPTIONS NIL) [fast-method]
 17: ((SB-PCL::EMF ASDF/OPERATE:OPERATE) #<unused argument> #<unused argument> #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl">)
 18: ((LAMBDA NIL :IN ASDF/OPERATE:OPERATE))
 19: ((:METHOD ASDF/OPERATE:OPERATE :AROUND (T T)) #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl">) [fast-method]
 --more--
The assertion
(CL-NUM-UTILS.NUM=:NUM=
 #1=(APPLY #'MGL-MAT::SUBMATRIX
           (MGL-MAT:MAT-TO-ARRAY MGL-MAT::MAT-C)
           (ARRAY-DIMENSIONS MGL-MAT::C))
 MGL-MAT::RESULT)
failed with #1# = #2A((218377.63042364857d0)), MGL-MAT::RESULT =
#2A((2.546922938428926d0)).
   [Condition of type SIMPLE-ERROR]

Restarts:
 0: [CONTINUE] Retry assertion.
 1: [RETRY] Retry #<TEST-OP > on #<SYSTEM "mgl/test">.
 2: [ACCEPT] Continue, treating #<TEST-OP > on #<SYSTEM "mgl/test"> as having been successful.
 3: [RETRY] Retry ASDF operation.
 4: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration.
 5: [RETRY] Retry ASDF operation.
 --more--

Backtrace:
  0: (SB-KERNEL:ASSERT-ERROR (CL-NUM-UTILS.NUM=:NUM= (APPLY (FUNCTION MGL-MAT::SUBMATRIX) (MGL-MAT:MAT-TO-ARRAY MGL-MAT::MAT-C) (ARRAY-DIMENSIONS MGL-MAT::C)) MGL-MAT::RESULT) 2 (APPLY (FUNCTION MGL-MAT::S..
  1: ((LAMBDA NIL :IN MGL-MAT::TEST-GEMM!))
  2: (MGL-MAT:CALL-WITH-CUDA #<FUNCTION (LAMBDA NIL :IN MGL-MAT::TEST-GEMM!) {53A2BCEB}> :ENABLED NIL :DEVICE-ID 0 :RANDOM-SEED 1234 :N-RANDOM-STATES 4096 :OVERRIDE-ARCH-P T :N-POOL-BYTES NIL)
  3: (MGL-MAT::TEST-GEMM!)
  4: (MGL-MAT::TEST)
  5: (MGL-TEST:TEST)
  6: (UIOP/PACKAGE:SYMBOL-CALL #:MGL-TEST #:TEST)
  7: ((:METHOD ASDF/ACTION:PERFORM (ASDF/LISP-ACTION:TEST-OP (EQL #<ASDF/SYSTEM:SYSTEM "mgl/test">))) #<unused argument> #<unused argument>) [fast-method]
  8: ((SB-PCL::EMF ASDF/ACTION:PERFORM) #<unused argument> #<unused argument> #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl/test">)
  9: ((LAMBDA NIL :IN ASDF/ACTION:CALL-WHILE-VISITING-ACTION))
 10: ((SB-PCL::SDFUN-METHOD ASDF/ACTION:PERFORM) #<unused argument> #<unused argument> #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl/test">)
 11: ((:METHOD ASDF/ACTION:PERFORM-WITH-RESTARTS :AROUND (T T)) #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl/test">) [fast-method]
 12: ((:METHOD ASDF/PLAN:PERFORM-PLAN (T)) #<ASDF/PLAN:SEQUENTIAL-PLAN {101658D813}>) [fast-method]
 13: ((FLET SB-C::WITH-IT :IN SB-C::%WITH-COMPILATION-UNIT))
 14: ((:METHOD ASDF/PLAN:PERFORM-PLAN :AROUND (T)) #<ASDF/PLAN:SEQUENTIAL-PLAN {101658D813}>) [fast-method]
 15: ((LAMBDA (SB-PCL::.ARG0. SB-INT:&MORE SB-PCL::.MORE-CONTEXT. SB-PCL::.MORE-COUNT.) :IN "/home/luke/quicklisp/setup.lisp") #<ASDF/PLAN:SEQUENTIAL-PLAN {101658D813}>)
 16: ((:METHOD ASDF/OPERATE:OPERATE (ASDF/OPERATION:OPERATION ASDF/COMPONENT:COMPONENT)) #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl"> :PLAN-CLASS NIL :PLAN-OPTIONS NIL) [fast-method]
 17: ((SB-PCL::EMF ASDF/OPERATE:OPERATE) #<unused argument> #<unused argument> #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl">)
 18: ((LAMBDA NIL :IN ASDF/OPERATE:OPERATE))
 19: ((:METHOD ASDF/OPERATE:OPERATE :AROUND (T T)) #<ASDF/LISP-ACTION:TEST-OP > #<ASDF/SYSTEM:SYSTEM "mgl">) [fast-method]
 --more--
melisgl commented 2 years ago

I can't reproduce this on a Thinkpad without Cuda on SBCL 2.1.2 ... of course. The problem could be with the foreign array interfacing of MGL-MAT.

The do-configurations calls in the test cases print the various foreign array strategies being tested. With what ctype and foreign array strategy do the errors occur? Check the cuda enabled is always NIL in the output.

lukego commented 2 years ago

Here's a gist containing complete contents of *inferior-lisp*, *slime-repl*, and *sldb*: https://gist.github.com/lukego/8430f3cc7962e18c482c87e702333e32

I checked and cuda enabled is NIL in every instance. Here's the last output before the error:

f: 200 (9,888)
* testing MGL-MAT:GEMM!
** ctype: :FLOAT
*** cuda enabled: NIL
**** foreign array strategy: :PINNED

If I try running the tests again in the same Lisp then I get all kinds of different errors as above, guessing heap is corrupted.

melisgl commented 2 years ago

Thanks. One possibility is that lla::with-pinned-array (implemented in terms of SB-SYS:WITH-PINNED-OBJECTS) is acting up. Try changing do-foreign-array-strategies in test-mat.lisp in mgl-mat to:

(defmacro do-foreign-array-strategies (() &body body)
  `(dolist (*foreign-array-strategy* (:static))
     ,@body))

... then recompile and test mgl-mat.

lukego commented 2 years ago

Tried that (with (:static) quoted) and still see the same behavior.

melisgl commented 2 years ago

Do LLA and CFFI tests pass reliably?

melisgl commented 2 years ago

Does this work?

(cffi:defcfun memcpy :void
  (dest :pointer)
  (src :pointer)
  (n :unsigned-long))

(loop repeat 1000
      do (let ((size (random 32000)))
           (let ((x (make-array size :initial-element 1 :element-type 'fixnum))
                 (y (make-array size :initial-element 0 :element-type 'fixnum)))
             (lla::with-pinned-array (xp x)
               (lla::with-pinned-array (yp y)
                 (memcpy yp xp (* 8 size))))
             (assert (= size (loop for e across y sum e))))))
lukego commented 2 years ago

Does this work?

Yes, that works.

LLA test suite passes consistently.

CFFI test suite errors, but not obviously for a related reason, can look into why tomorrow:

Unable to load any of the alternatives:
   ("libffi.so.7" "libffi32.so.7" "libffi.so.6" "libffi32.so.6"
    "libffi.so.5" "libffi32.so.5")
   [Condition of type CFFI:LOAD-FOREIGN-LIBRARY-ERROR]

Restarts:
 0: [RETRY] Try loading the foreign library again.
 1: [USE-VALUE] Use another library instead.
 2: [TRY-RECOMPILING] Recompile libffi and try loading it again
 3: [RETRY] Retry loading FASL for #<CL-SOURCE-FILE "cffi-libffi" "libffi" "libffi">.
 4: [ACCEPT] Continue, treating loading FASL for #<CL-SOURCE-FILE "cffi-libffi" "libffi" "libffi"> as having been successful.
 5: [RETRY] Retry ASDF operation.
 --more--
lukego commented 2 years ago

Interesting. I had to fight a little to get CFFI test suite to run including making it accept libffi.so.8 instead of libffi.so.7 and doing some hack-and-slash to make the pkgconfig file available on NixOS. In this situation the test suite fails:

4 out of 332 total tests failed: FSBV.WFO, FSBV.MAKEPAIR.1, FSBV.MAKEPAIR.2, 
   TEST-STATIC-PROGRAM.
1 unexpected failures: TEST-STATIC-PROGRAM.; 

I'm not sure if this is relevant or not but might be something to follow up on CFFI. Full log at https://gist.github.com/lukego/fca6cdff0c507b6c844bee5b9a09502c.

melisgl commented 2 years ago

There is a similar open issue in mgl-mat (https://github.com/melisgl/mgl-mat/issues/3).

melisgl commented 2 years ago

Assuming some mgl-mat tests cause corruption (?) and others do not, maybe try commenting out some from the TEST function in mgl-mat/test/test-mat.lisp. Perhaps there is a minimal test case to be had or some observation about what kind of tests cause problems.

lukego commented 2 years ago

This turns out to only happen with OpenBLAS. It works fine now that I tried an alternative libblas.