torch / torch7

http://torch.ch
Other
8.97k stars 2.38k forks source link

Some of Torch7's native libraries segfault when compiled with musl libc #549

Open vifino opened 8 years ago

vifino commented 8 years ago

Hello.

Torch seems to have some incompatabilities with musl. One of the cases where it will segfault is when nn is required.

(gdb) run
Starting program: /torch/bin/luajit 
LuaJIT 2.1.0-beta1 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/

 _____              _     
|_   _|            | |    
  | | ___  _ __ ___| |__  
  | |/ _ \| '__/ __| '_ \ 
  | | (_) | | | (__| | | |
  \_/\___/|_|  \___|_| |_|

JIT: ON SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
th> nn=require("nn")
[New LWP 1403]
[New LWP 1404]
[New LWP 1405]
[New LWP 1406]
[New LWP 1407]
[New LWP 1408]
[New LWP 1409]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7d9a113 in malloc_usable_size () from /lib/ld-musl-x86_64.so.1

I am not sure if this is due to a bug in musl or something else. I am sure however that there must be some sort of incompatability that needs fixing.

My testing environment is Alpine Linux v3.3.

If you have Docker installed, you can run docker run --rm -it vifino/torch, which will put you right in the th repl. If you want to debug with gdb, run docker run --rm -vpwd:/pwd -it vifino/torch sh to get a shell and apk update && apk add gdb to update repository lists and install gdb.

Thanks, Adrian "vifino" Pistol

soumith commented 8 years ago

@vifino i dont have musl or Alpine Linux, and I am unlikely to investigate further. The stack points to malloc_usable_size, and you can compile torch with -DHAVE_MALLOC_USABLE_SIZE=0 to see if that fixes things (maybe libmusl has an incorrect implementation of that function). You'll have to

git clone https://github.com/torch/torch7
cd torch7
# modify this line to always trigger :  https://github.com/torch/torch7/blob/master/lib/TH/CMakeLists.txt#L90-L92
luarocks make rocks/torch-scm-1.rockspec

Hope this helps.

korkinof commented 8 years ago

Hi guys, I may have a similar issue. I started having segfaults, but I am not sure whether it is a bug on my side. It shouldn't as I am not using any proprietary C or CUDA code. Anyway, I am posting my backtrace here, still I am not sure this is relevant to this issue or whether it is an issue in torch7 or elsewhere.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe6f0a700 (LWP 6914)]
musable (mem=0x7bffbbb41ba0) at malloc.c:4567
4567    malloc.c: No such file or directory.
(gdb) bt
#0  musable (mem=0x7bffbbb41ba0) at malloc.c:4567
#1  __malloc_usable_size (m=0x7bffbbb41ba0) at malloc.c:4581
#2  0x00007ffff5647bb9 in THFree () from /usr/local/lib/libTH.so
#3  0x00007ffff56713f6 in THFloatTensor_free () from /usr/local/lib/libTH.so
#4  0x00007ffff5cfd05d in torch_FloatTensor_free () from /usr/local/lib/lua/5.1/libtorch.so
#5  0x0000000000475bc9 in lj_BC_FUNCC ()
#6  0x0000000000416e19 in gc_call_finalizer ()
#7  0x0000000000443c56 in gc_finalize ()
#8  0x0000000000443dd3 in gc_onestep ()
#9  0x0000000000444304 in lj_gc_step ()
#10 0x000000000045cea4 in lua_newuserdata ()
#11 0x00007ffff5aa02ac in luaT_pushudata () from /usr/local/lib/libluaT.so
#12 0x00007ffff5cfde83 in torch_FloatTensor___index__ () from /usr/local/lib/lua/5.1/libtorch.so
#13 0x0000000000475bc9 in lj_BC_FUNCC ()
#14 0x00007ffff5a9ea86 in luaT_mt__index () from /usr/local/lib/libluaT.so
#15 0x0000000000475bc9 in lj_BC_FUNCC ()
#16 0x00000000004627c0 in lua_pcall ()
#17 0x00007fffc4d0e607 in newthread () from /usr/local/lib/lua/5.1/libthreads.so
#18 0x00007fffc4d100eb in thread_closure () from /usr/local/lib/lua/5.1/libthreads.so
#19 0x00007ffff7170182 in start_thread (arg=0x7fffe6f0a700) at pthread_create.c:312
#20 0x00007ffff6c8747d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
soumith commented 8 years ago

@korkinof can you see if pre-loading a different allocator like libjemalloc will help, even when linking against musllibc?

korkinof commented 8 years ago

So I have jemalloc installed from Ubuntu repos and I did this before running: export LD_PRELOAD=$LD_PRELOAD:/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 Would that do the trick? Coz it didn't help...

korkinof commented 8 years ago

Now running single-threaded data loading, it still crashes more rarely I think. The backtrace is similar.

`Program received signal SIGSEGV, Segmentation fault. _int_malloc (av=0x7ffff6f4b760 , bytes=16) at malloc.c:3489 3489 malloc.c: No such file or directory. (gdb) bt

0 _int_malloc (av=0x7ffff6f4b760 , bytes=16) at malloc.c:3489

1 0x00007ffff6c0f7b0 in __GI___libc_malloc (bytes=16) at malloc.c:2891

2 0x00007ffff5647aca in THAlloc () from /usr/local/lib/libTH.so

3 0x00007ffff565248d in THByteTensor_rawResize () from /usr/local/lib/libTH.so

4 0x00007ffff56620fb in THByteTensor_newWithTensor () from /usr/local/lib/libTH.so

5 0x00007ffff5ce4bcf in torch_ByteTensor_index () from /usr/local/lib/lua/5.1/libtorch.so

6 0x0000000000475bc9 in lj_BC_FUNCC ()

7 0x00007ffff5a9ea86 in luaT_mt__index () from /usr/local/lib/libluaT.so

8 0x0000000000475bc9 in lj_BC_FUNCC ()

9 0x0000000000416e19 in gc_call_finalizer ()

10 0x0000000000443ce0 in gc_finalize ()

11 0x0000000000443dd3 in gc_onestep ()

12 0x0000000000462ba8 in lua_gc ()

13 0x0000000000462cae in lj_cf_collectgarbage ()

14 0x0000000000475bc9 in lj_BC_FUNCC ()

15 0x00000000004635d2 in lj_cf_dofile ()

16 0x0000000000475bc9 in lj_BC_FUNCC ()

17 0x00000000004627c0 in lua_pcall ()

18 0x0000000000405d3b in dotty ()

19 0x0000000000406dbe in pmain ()

20 0x0000000000475bc9 in lj_BC_FUNCC ()

21 0x0000000000462842 in lua_cpcall ()

22 0x0000000000404c48 in main ()

`