mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.34k stars 95 forks source link

segfault with pca without -d:danger or -d:release #440

Open brentp opened 4 years ago

brentp commented 4 years ago

I am getting a segfault with pca, but only when built without release and without danger.

with gdb, I see:

Thread 16 "somalier" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff9c7b1700 (LWP 4913)]
nimFrame (s=0x7fff9c7b0690) at /home/brentp/.cache/nim/somalier_d/@m..@s..@s..@s.nimble@spkgs@sarraymancer-@hhead@stensor@soperators_blas_l1.nim.c:359
359         (*s).calldepth = (NI16)((*framePtr__HRfVMH3jYeBJz6Q6X9b6Ptw).calldepth + ((NI16) 1));
(gdb) bt
#0  nimFrame (s=0x7fff9c7b0690) at /home/brentp/.cache/nim/somalier_d/@m..@s..@s..@s.nimble@spkgs@sarraymancer-@hhead@stensor@soperators_blas_l1.nim.c:359
#1  check_size__A1o8pjA8sSUzNxmn3BamlAp_checks (a=a@entry=0x7ffffffdf2c0, b=b@entry=0x7fff9c7b0cd0)
    at /home/brentp/.cache/nim/somalier_d/@m..@s..@s..@s.nimble@spkgs@sarraymancer-@hhead@stensor@soperators_blas_l1.nim.c:1717
#2  0x00005555556a2045 in pluseq___bwzvgAiJVLEdRKerBiTtXA._omp_fn.0 ()
    at /home/brentp/.cache/nim/somalier_d/@m..@s..@s..@s.nimble@spkgs@sarraymancer-@hhead@stensor@soperators_blas_l1.nim.c:1893
#3  0x00007ffff7e32e96 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
#4  0x00005555556a8f7b in pluseq___bwzvgAiJVLEdRKerBiTtXA (a=0x7ffffffdf2c0, b=b@entry=0x7fff9c7b0cd0)
    at /home/brentp/.cache/nim/somalier_d/@m..@s..@s..@s.nimble@spkgs@sarraymancer-@hhead@stensor@soperators_blas_l1.nim.c:1848
#5  0x00005555556de762 in sum__Y49asUKPCVhBx9cvcl0dU9blA._omp_fn.0 () at /home/brentp/.cache/nim/somalier_d/@m..@s..@s..@s.nimble@spkgs@sarraymancer-@hhead@stensor@saggregate.nim.c:1184
#6  0x00007ffff7e3c31e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
#7  0x00007ffff7e07669 in start_thread (arg=<optimized out>) at pthread_create.c:479
#8  0x00007ffff7d2f323 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

if that's not helpful I can try to get means to recreate (from somalier)

mratsim commented 4 years ago

Looking at the stacktrace we have this

(*s).calldepth = (NI16)((*framePtr__HRfVMH3jYeBJz6Q6X9b6Ptw).calldepth + ((NI16) 1));

NI16 is int16 and can only hold integer up to 16384. It probably is crashing on an overflow error. How big was the tensor? I'm surprised that the calldepth could reach that much, there shouldn't be any recursion in the sum functions mentioned. https://github.com/mratsim/Arraymancer/blob/fe896870f8a67f961a930f832af72354f32c3da2/src/tensor/aggregate.nim#L27-L35

nimFrame are not inserted in release mode hence it doesn't appear there. It should also disappear with --stacktraces:off (which push/pop probably being the proper fix) and maybe with --overflowChecks:off

brentp commented 4 years ago

with --stackTrace:off I get:

Error: unhandled exception: /home/brentp/.nimble/pkgs/arraymancer-0.6.0/tensor/selectors.nim(218, 26) `dstSlice`gensym34935440[axis].a == size`gensym34935437`  [AssertionDefect]
brentp commented 4 years ago

that's occurring in the code that's using the new fancy indexing, so I assume that's corrupting memory and then the error is appearing later (?).

brentp commented 4 years ago

that assertion error reproducible with:

var T = randomTensor(2504, 17384, 0.5'f32)
var sel = randomTensor(T.shape[1], 1'f32).asType(bool)
sel[100..200] = false
T = T[_, sel]
mratsim commented 4 years ago

It seems like the issue is with reassign a tensor to itself, this doesn't trigger the assertion:

var T = randomTensor(2504, 17384, 0.5'f32)
var sel = randomTensor(T.shape[1], 1'f32).asType(bool)
sel[100..200] = false
let U = T[_, sel]

It might even solve your original bug, I'm not sure how to prevent that though.