wahern / cqueues

Continuation Queues: Embeddable asynchronous networking, threading, and notification framework for Lua on Unix.
http://25thandclement.com/~william/projects/cqueues.html
MIT License
248 stars 37 forks source link

cqueues unusable on aarch64 under LuaJIT #241

Open pspacek opened 4 years ago

pspacek commented 4 years ago

cqueues on Debian on aarch64 explodes with error:

/usr/bin/luajit -l cqueues -e os.exit(0)

Full build log: https://buildd.debian.org/status/fetch.php?pkg=knot-resolver&arch=arm64&ver=5.1.2-1&stamp=1596037546&raw=0

This is probably related to some LuaJIT limitations on aarch64 but it effectivelly makes cqueues unusable for applications where performance matters.

pspacek commented 4 years ago

Oh, I forgot to copy&paste the error itself!

/usr/bin/luajit: bad light userdata pointer
stack traceback:
    [C]: at 0xffffb6342ad0
    [C]: in function 'require'
    /usr/share/lua/5.1/cqueues.lua:2: in function </usr/share/lua/5.1/cqueues.lua:1>
    [C]: at 0xaaaae1757d08
    [C]: at 0xaaaae170a4c0
daurnimator commented 4 years ago

Can you confirm that CQS_USE_47BIT_LIGHTUSERDATA_HACK is being set?

Could you provide a C stack trace?

vcunat commented 4 years ago

Debian doesn't provide -dbgsym for this platform (no idea why). When I build it myself from master on the same machine, the error disappears :-/

daurnimator commented 4 years ago

Looks like your debian build could be using an old release:

Get:110 https://deb.debian.org/debian unstable/main arm64 lua-cqueues arm64 20190813-1 [179 kB]

vcunat commented 4 years ago

No, that cqueues version had this fixed already. And my testing was with this anyway:

# apt show lua-cqueues
Package: lua-cqueues
Version: 20200726-1
[...]
rhenwood-arm commented 4 years ago

I'm not close to this project, but have some familiarity with the LuaJIT story on AArch64. MoonJIT has an active AArch64 port - and may be an alternative?

vcunat commented 4 years ago

This is the same with moonjit 2.1.2.

I guess this isn't enough "proof"?

(gdb) bt
#0  lj_err_msg (L=0x18255adee378, em=em@entry=LJ_ERR_BADLU) at lj_err.c:645
#1  0x0000aaaaaab00180 in lua_pushlightuserdata (L=<optimized out>, p=<optimized out>) at lj_state.c:121
#2  0x0000fffff7d29378 in luaopen.cqueues () from /usr/local/lib/lua/5.1/_cqueues.so
daurnimator commented 4 years ago

@vcunat recompile cqueues with -O0; I want to see the exact line in _cqueues.so that is causing the failure.

vcunat commented 4 years ago

As I wrote, I can't reproduce the failure when recompiling by hand (so far). Maybe the address layout is sensitive to something that I don't know about.

vcunat commented 4 years ago

Line isn't surprising: https://salsa.debian.org/lua-team/lua-cqueues/-/blob/2eb74cf0/src/cqueues.c#L2940

vcunat commented 4 years ago

It's the defined(LUA_JITLIBNAME) in the condition. Debian naturally assumes that packages built against vanilla lua 5.1 will be usable with luajit, moonjit, etc.

vcunat commented 4 years ago

For reference, luaossl suffers from the very same (except its version is ancient on Debian anyway).

daurnimator commented 4 years ago

Debian naturally assumes that packages built against vanilla lua 5.1 will be usable with luajit, moonjit, etc.

sadly that isn't true for 64bit platforms. if they want to share the same build for both lua 5.1 and LuaJIT then they'll need to compile with CQS_USE_47BIT_LIGHTUSERDATA_HACK defined.

pspacek commented 4 years ago

@yac Please have a look once you get to Debian packaging.

pspacek commented 4 years ago

@daurnimator Do you see any reason why not enable CQS_USE_47BIT_LIGHTUSERDATA_HACK on all platforms? Or at least on all 64 bit platforms? Does it have any adverse effects?

vcunat commented 4 years ago

Then it might be worth fixing debian/rules in this repo as well, though I don't know if anyone uses it.

daurnimator commented 4 years ago

@daurnimator Do you see any reason why not enable CQS_USE_47BIT_LIGHTUSERDATA_HACK on all platforms? Or at least on all 64 bit platforms? Does it have any adverse effects?

Yes: it could result in mysterious errors/failures/undefined behaviour if you manage to map memory with the same trailing 47 bits.

pspacek commented 4 years ago

Hmm, that sounds like a good reason not to enable it everywhere :-(

vcunat commented 4 years ago

If the collision risk is notable and assuming we're OK with focusing on ==48 bits, I'd suggest using right shift by one bit instead of that mask. In most use cases you know for certain that you won't use the hack on two addresses one byte apart.

daurnimator commented 4 years ago

If the collision risk is notable and assuming we're OK with focusing on ==48 bits, I'd suggest using right shift by one bit instead of that mask. In most use cases you know for certain that you won't use the hack on two addresses one byte apart.

One idea I had was to make the contents of the pointer aligned to e.g. 32bytes, so that we can take the pointer of the end of it, and shift right by 5 (and then use the same 47-bit mask). Does e.g. aarch64 on debian ever set the top 12 bits of pointers?

vcunat commented 4 years ago

Well... to get more confidence, I'd look at ARMv8 manual – page 2083 describes what the virtual addresses can look like. If I understand it right, "negative" values are allowed, and thus the total possible space is at most 2x 48 bits or 2x 52 bits.

Say we want to focus on 2x 48 only (for now). Right shift by two bits should give us unique values even after masking to 47 bits (assuming no pointers get less than four bytes apart). It's even possible to check whether the passed pointer fits into the 2x48 range (say, in assert mode at least).

I'm not sure if 64-byte alignment would be realistic in cqueues case, to allow using the same approach even for 2x52. As for current practice, I don't expect the 52 variant is common yet. I haven't seen a "negative" pointer yet but I haven't really looked so far (maybe they're not useful in user-space).