Closed lukego closed 2 years ago
I can't reproduce this on a Thinkpad without Cuda on SBCL 2.1.2 ... of course. The problem could be with the foreign array interfacing of MGL-MAT.
The do-configurations calls in the test cases print the various foreign array strategies being tested.
With what ctype and foreign array strategy do the errors occur?
Check the cuda enabled
is always NIL in the output.
Here's a gist containing complete contents of *inferior-lisp*
, *slime-repl*
, and *sldb*
: https://gist.github.com/lukego/8430f3cc7962e18c482c87e702333e32
I checked and cuda enabled
is NIL
in every instance. Here's the last output before the error:
f: 200 (9,888)
* testing MGL-MAT:GEMM!
** ctype: :FLOAT
*** cuda enabled: NIL
**** foreign array strategy: :PINNED
If I try running the tests again in the same Lisp then I get all kinds of different errors as above, guessing heap is corrupted.
Thanks. One possibility is that lla::with-pinned-array (implemented in terms of SB-SYS:WITH-PINNED-OBJECTS) is acting up. Try changing do-foreign-array-strategies in test-mat.lisp in mgl-mat to:
(defmacro do-foreign-array-strategies (() &body body)
`(dolist (*foreign-array-strategy* (:static))
,@body))
... then recompile and test mgl-mat.
Tried that (with (:static)
quoted) and still see the same behavior.
Do LLA and CFFI tests pass reliably?
Does this work?
(cffi:defcfun memcpy :void
(dest :pointer)
(src :pointer)
(n :unsigned-long))
(loop repeat 1000
do (let ((size (random 32000)))
(let ((x (make-array size :initial-element 1 :element-type 'fixnum))
(y (make-array size :initial-element 0 :element-type 'fixnum)))
(lla::with-pinned-array (xp x)
(lla::with-pinned-array (yp y)
(memcpy yp xp (* 8 size))))
(assert (= size (loop for e across y sum e))))))
Does this work?
Yes, that works.
LLA test suite passes consistently.
CFFI test suite errors, but not obviously for a related reason, can look into why tomorrow:
Unable to load any of the alternatives:
("libffi.so.7" "libffi32.so.7" "libffi.so.6" "libffi32.so.6"
"libffi.so.5" "libffi32.so.5")
[Condition of type CFFI:LOAD-FOREIGN-LIBRARY-ERROR]
Restarts:
0: [RETRY] Try loading the foreign library again.
1: [USE-VALUE] Use another library instead.
2: [TRY-RECOMPILING] Recompile libffi and try loading it again
3: [RETRY] Retry loading FASL for #<CL-SOURCE-FILE "cffi-libffi" "libffi" "libffi">.
4: [ACCEPT] Continue, treating loading FASL for #<CL-SOURCE-FILE "cffi-libffi" "libffi" "libffi"> as having been successful.
5: [RETRY] Retry ASDF operation.
--more--
Interesting. I had to fight a little to get CFFI test suite to run including making it accept libffi.so.8
instead of libffi.so.7
and doing some hack-and-slash to make the pkgconfig file available on NixOS. In this situation the test suite fails:
4 out of 332 total tests failed: FSBV.WFO, FSBV.MAKEPAIR.1, FSBV.MAKEPAIR.2,
TEST-STATIC-PROGRAM.
1 unexpected failures: TEST-STATIC-PROGRAM.;
I'm not sure if this is relevant or not but might be something to follow up on CFFI. Full log at https://gist.github.com/lukego/fca6cdff0c507b6c844bee5b9a09502c.
There is a similar open issue in mgl-mat (https://github.com/melisgl/mgl-mat/issues/3).
Assuming some mgl-mat tests cause corruption (?) and others do not, maybe try commenting out some from the TEST function in mgl-mat/test/test-mat.lisp. Perhaps there is a minimal test case to be had or some observation about what kind of tests cause problems.
This turns out to only happen with OpenBLAS. It works fine now that I tried an alternative libblas.
I'm seeing assorted errors when running
(ASDF:OOS 'ASDF:TEST-OP :MGL)
on SBCL 2.1.2 on Linux on a machine w/o CUDA (Thinkpad.)Captured a few samples below. I'm not sure if these errors are local to the actual problem or if they are indirectly caused by earlier heap corruption.
Anyone have a suggestion how to troubleshoot this?