q66 / cffi-lua

A portable C FFI for Lua 5.1+
MIT License
176 stars 24 forks source link

segfaults in ffi::newctype #27

Closed niess closed 3 years ago

niess commented 3 years ago

Hello,

thank you for the previous patches.

I am currently stuck with the following issue. When running a "complex" program at some point I get segfaults in ffi::newctype. Below is a typical example of a trace obtained with gdb:

#0  0x0000555555564385 in sweeplist ()
#1  0x0000555555564475 in sweepstep ()
#2  0x00005555555658b4 in singlestep ()
#3  0x0000555555566020 in luaC_step ()
#4  0x000055555555e137 in lua_newuserdatauv ()
#5  0x00007ffff73fb4db in operator new (n=40, L=0x5555557942a8) at ../src/lua.hh:158
#6  0x00007ffff73fa83e in ffi::newctype<ast::c_type>(lua_State *, <unknown type in lib/lua/5.4/cffi.so, CU 0xca9, DIE 0x51d2>) (
    L=0x5555557942a8, args#0=<unknown type in lib/lua/5.4/cffi.so, CU 0xca9, DIE 0x51d2>) at ../src/ffi.hh:282
...

Unfortunately I could not reproduce this issue with a minimal example. The segfaults happen in several of my use cases and are always triggered by the previous sequence: ffi::newctype/sweeplist. It does not happen on the first call to ffi::newctype but rather after O(100) calls or so.

Sorry, this is not very helpful but I don't know what to check at this point? Please, let me know if there are extra values that would be meaningful to be printed out, e.g. using gdb?

When using LuaJIT/ffi I have no segfaults.

q66 commented 3 years ago

this doesn't even look like our bug, but possibly a bug in Lua itself (in its garbage collector)

you should try different versions as well, and make sure you have the latest version of 5.4 (currently 5.4.2)

niess commented 3 years ago

Thanks for the hints. I tried with Lua 5.3.5 and now the segfaults seem to be captured resulting in aborts. E.g. I get the following error messages:

free(): invalid next size (normal)
Aborted

or

malloc(): invalid size (unsorted)
Aborted

Could these messages be generated by cfii? Or by Lua itself? It would be helpful to see where in the Lua code this happens, e.g. with an error trace. But maybe that's not possible from an abort?

I could not yet pinpoint the problem to a minimal example. Maybe it is related to my application mixing Lua and direct C allocations / free? In principle I only free memory that was allocated with malloc (if no bug in my app). But, for example I have cases where I do ptr = ffi.new('void *[1]') and then I give over ptr to a C library that allocates (frees) memory in ptr[0] using malloc (free). I use ffi.gc to ensure that memory is released when ptr is garbage collected.

q66 commented 3 years ago

well, might be our bug but without a testcase there isn't really anything i can do

these messages are generated by glibc's memory allocator

niess commented 3 years ago

@q66 I finally found out the reason of the memory errors in my application. It looks like structures with arrays of dimension larger than one have wrong size. E.g. the following currently fails with cffi-lua but works with LuaJIT/ffi:

local ffi = jit and require('ffi') or require('cffi')

ffi.cdef([[
struct transform {
    double matrix[3][3];
};
]])

assert(ffi.sizeof('struct transform') == 3 * 3 * ffi.sizeof('double'))

The structure has a size of 3*8=24 instead of 3*3*8 =72. I think that it really has a wrong size, i.e. ffi.sizeof likely reports the actually allocated size but the allocated size is wrong (too small). I think so because I have corrupted memory due to overwriting the heap when using such constructs.

Note that ffi.sizeof('double [3][3]') however seems to be correct, i.e. outside of a structure.

q66 commented 3 years ago

I see, that would explain it...

q66 commented 3 years ago

okay, that should be fixed now... thanks for reporting

niess commented 3 years ago

Thanks for the patch. It works fine now :)