Closed deliciouslytyped closed 3 years ago
An attempt at debugging that I forogt to mention, also the most fruitful so far, was trying to use strace. A short excerpt is attached here:
....
read(3, ",placeholder"..., 498) = 498
sendto(6, "placeholder"..., 545, 0, {sa_family=AF_INET, sin_port=htons(1676), sin_addr=inet_addr("[redacted]")}, 16) = -1 EMSGSIZE (Message too long)
read(3, 0x7ffdbe246372, 66033) = -1 EFAULT (Bad address)
read(3, 0x7ffdbe246371, 66034) = -1 EFAULT (Bad address)
read(3, 0x7ffdbe246370, 66035) = -1 EFAULT (Bad address)
read(3, 0x7ffdbe24636f, 66036) = -1 EFAULT (Bad address)
read(3, 0x7ffdbe24636e, 66037) = -1 EFAULT (Bad address)
read(3, 0x7ffdbe24636d, 66038) = -1 EFAULT (Bad address)
read(3, 0x7ffdbe24636c, 66039) = -1 EFAULT (Bad address)
read(3, 0x7ffdbe24636b, 66040) = -1 EFAULT (Bad address)
....
There's a bunch of sendto() and reads() with the return value (?) decreasing one by one, and then a bunch of those faults. I'm assuming this is what trashes the stack.
TODO: get a better strace and don't axe so much info
TL;DR on the rr issue is that rr uses some of it's own syscalls (or something like that) for internal communication, and systemd-nspawn didn't have them whitelisted, because my whitelist was based on the man page enumeration of the kernels built in syscall names. Furthermore rr (IIRC) didn't warn or abort about this in version 5.3.0 which was the original version I was using.
rr master currently contains checks and notifies the user if this is an issue, as well as suggesting the -n flag which avoids it.
I was able to successfully create an rr recording of tinc in the container with rr record -n ...
, this should allow (me) to reproducibly debug the issue, while also not having access to the specific network that causes the crash.
My current reading is the following:
https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net_packet.c#L1338 is passing 18+65535 as the length argument while https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net_packet.c#L1085 allocates a vpn packet with a buffer size of 1673 for some reason, per:
gef➤ ptype vpn_packet_t
type = struct vpn_packet_t {
length_t len;
length_t offset;
int priority;
uint8_t data[1673];
}
...namely https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net.h#L36 in https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net.h#L91-L96 I don't know why this isn't causing weird crashes all over the place on other people's machines. So basically the stack gets smashed with 64k of presumably random data by a read() call in randomize()? Once that overwrites the stack, the stack smashing check at the end of the probe function triggers, causing the program to abort.
What I still don't understand is why this crash triggers reliably, in the same piece of tinc code, on a single specific network, every (? - or at least almost every) time I try to start tincd.
Another possible interpretation is: why is an offset so big? - or is it underflowing or -1 or something?
gef➤ ptype length_t
type = unsigned short
I.e. a two byte value.
The assembly code between the pow function and the probe call, including the addition is here:
0x0000560de58b5d39 <+345>: call 0x560de58ab0a0 <powf@plt>
0x0000560de58b5d3e <+350>: mov rdi,r12
0x0000560de58b5d41 <+353>: movzx r13d,WORD PTR [r12+0x23e]
0x0000560de58b5d4a <+362>: cvttss2si esi,xmm0
0x0000560de58b5d4e <+366>: movzx esi,si // changes esi from 0xffffffff to 0xffff
0x0000560de58b5d51 <+369>: add esi,r14d
0x0000560de58b5d54 <+372>: call 0x560de58b5930 <send_udp_probe_packet>
I can't just inspect the value of offset trivially because its optimized out.
The series of offset values decreases one by one through 1328..512 and then in the next cycle its 66407. (-or something like that, I mightve mixed things up) I haven't figured out what exactly is going on yet.
You might also be interested in compiling it with -g -fsanitize=address,undefined
to use ASAN+UBSAN, they are supported by both GCC and Clang now. Sanitizers detect some problems better than valgrind (although there are still problems which are better detected by valgrind than by ASAN/UBSAN/MSAN) and has less overhead in run-time and (IMHO) slightly better diagnostic output.
Upd:
const length_t offset = powf(...)
Oh, this is a definitely bad construction that could be detected by -Wnarrowing
or -Wconversion
compiler options.
From what little I figured out, I don't think the cast was what was actually responsible for the value? At least not directly.
I imagine the problem is why 0x0000560de58b5d4a <+362>: cvttss2si esi,xmm0
loads 0xffffffff (or however long it was) into esi in the first place? Thats -1 right? Or some floating point special value?
I'm probably completely misinterpreting what's going on here.
Oh - or is this a signed-unsigned conversion issue and it is just misinterpreting a -1?
Ok, I did what I should have done anyway and checked what the value is (e.g.) https://stackoverflow.com/questions/13673045/what-will-be-the-value-in-float-if-i-have-a-binary-number-as-1111111111111111-an , and it is in fact -nan on my machine.
So the question if which case in https://en.cppreference.com/w/c/numeric/math/pow is triggered? I still haven't managed to read out the float args properly in the case of the failing iteration I thinkm so that's probably the next task unless someone can tell what's wrong by staring at the code.
If I'm doing this right, powf() is getting called with powf(-1, 1);
gef➤ print $xmm1
$69 = {
v4_float = {1, 0, 0, 0},
v2_double = {5.2635442471208903e-315, 0},
v16_int8 = {0x0, 0x0, 0x80, 0x3f, 0x0 <repeats 12 times>},
v8_int16 = {0x0, 0x3f80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
v4_int32 = {0x3f800000, 0x0, 0x0, 0x0},
v2_int64 = {0x3f800000, 0x0},
uint128 = 0x3f800000
}
gef➤ print $xmm0
$70 = {
v4_float = {-1, 0, 0, 0},
v2_double = {1.5873523201947252e-314, 0},
v16_int8 = {0x0, 0x0, 0x80, 0xbf, 0x0 <repeats 12 times>},
v8_int16 = {0x0, 0xbf80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
v4_int32 = {0xbf800000, 0x0, 0x0, 0x0},
v2_int64 = {0xbf800000, 0x0},
uint128 = 0xbf800000
}
(-1)^1 = -1, which would explain why the return value is -1.
That doesn't make sense though because interval
should be 512?
I am just labelling / cleaning up the issue list. This is the most detailed investigation I have come across so far. Thanks for all the effort you put in.
pow can be underflowed if n->maxmtu
is less than 512 due to https://github.com/gsliepen/tinc/blob/1.1/src/net_packet.c#L1328
Or in other words minmtu is not necessarily less than maxmtu.
Although the MTU isnt that small normally I'm guessing it can be erroneously lowered?
Anyway PR to follow shortly.
Oh and thanks @deliciouslytyped for all your troubleshooting. That powf(-1, ...)
was all the hint I needed.
Moving from a specific physical wifi network to another (and removing my laptop from the dock (TODO: rule this out)) causes the tincd I have running in a container (TODO: rule this out) to segfault reproducibly, but I haven't been able to isolate the bug. It consistently gets up to the
Sending 70 bytes of metadata to [redacted] ([redacted] port 20)
portion of the attached log, and then (obviously at some point after this) segfaults.Starting tincd with gdb results in the stack being incomprehensible (to me). Starting it with
rr record
results in a stack smashing notification, which probably explains the previous point. However due to anrr replay
issue I haven't been able to yield any good info from that either. Maybe valgrind is worth a shot?Here is a log:
One may note (what I assume is) a race condition at
Cannot open config file /etc/tinc/gate/hosts/[redacted]: No such file or directory
(probably caused by--chaos
), which succeeds a few lines later.Notes on debugging: (mainly as a note to myself), systemd: this is running in a container on nixos, and the main service configs are on a read-only file system. This makes changing the service file to run a tincd with debug symbols, or otherwise, impossible. It turns out to be a bit obscure but
systemd-analyze unit-paths
list the list of paths systemd searches for service files. However I'm told this "isn't accurate" on nixos. Before that I found a list at https://www.freedesktop.org/software/systemd/man/systemd.unit.html (via https://askubuntu.com/questions/876733/where-are-the-systemd-units-services-located-in-ubuntu) , however I don't know how the list is derived, nor do I know how to change or add entries. Furthermore some of the directories listed seem to be special purpose(randomly putting the .service in some of them didn't work and caused systemd to not see the service or complain). (TODO: look into this)This was need for two debugging variants: gdb:
However, in the end it wasn't actually necessary to start tincd via the service files, the issues were reproducible just fine by calling the executable on the command line. I was just trying to keep reproduction in as similar an environment as the original. Part of my confusion was caused by tincd behaving different if started by gdb, than if gdb was just attached to it. This ended up being an issue with aslr, and was fixed with
set disable-randomisation off
. I don't know what could cause this, but the symptoms were that tinc would seemingly hang instead of crashing.gdb variant: I ended up copying the containers existing tinc network service file to /run/systemd/system and editing the service files there. I used "nix-store --add" to add the Exec script to the nix store to work around an issue that I didn't look into further, involving when I tried to put the script in the same directory as the service file, and referring to it by absolute path in the service definition ("/run/systemd/system/gate-init" or such). That may have been a problem of me mixing up the outer system and the container, but I'm not sure. It might have been systemd sandboxing related. The gdb command ended up being: lost it in scroll, but its similar to the later rr command, something like
gdb path/to/tincd -ex "run -tinc -parameters -here"
. The tincd path used was the nix store path corresponding to the output ofnix-build -I nixpkgs=channel:nixos-unstable -E "with import <nixpkgs> {}; tinc_pre.overrideAttrs (old: { dontStrip = true; })" -v --no-out-link
.rr variant: The first issue while trying to run rr in the container was
[FATAL /build/source/src/PerfCounters.cc:317:start_counter() errno: EPERM] Failed to initialize counter
and some backtrace that I'm missing. I'd recently heard that rr does stuff with cpu performance counter registers, so I figured this was a container permissions issue (eventually discovering https://github.com/mozilla/rr/wiki/Docker ). So it turns out in the wonderful feat of engineering that systemd is,systemd-nspawn
doesn't currently allow disabling it's seccomp protections per: https://lists.freedesktop.org/archives/systemd-devel/2020-June/044756.html . I ended up getting a list of syscalls from https://man7.org/linux/man-pages/man2/syscalls.2.html , cleaning it up a bit, and runningcat syscalls | cut -d " " -f 8 | cut -d "(" -f 1 | xargs -I{} echo --system-call-filter="{}" | xargs echo
to generate the syscall whitelist. Using the systemd service methods listed above I constructed anixos-container
derived configuration manually (I really should have just generated a new container with the parameters set...). For anyone needing to do such a hack sometime, you can copy it from appendix 2. I also added--capability="CAP_SYS_PTRACE" \
and--capability="CAP_SYS_ADMIN" \
in the hopes that it would be enough, before gebnerating the whitelist, but I don't know if these did anything. Which is to say, I don't know which of all this stuff I did at the same time did anything. The rr command ended up being:/nix/store/yairbna66xfrib2snjdq9z5gxqxlk4bw-rr-5.3.0/bin/rr record /nix/store/3218jrqqrh8cbbnhijhh2vxw2m9vm85m-tinc-1.1pre17/bin/tincd -D -U gate -n gate --pidfile /run/gate.pid -d 4
There was an issue at some point where trying to run the tincd in rr complained about some sort of dynamic linking error:I don't know the cause of this but I probably used the wrong tincd path (nixos lets you link different versions of stuff against eachother while on the same system), I just got the one from the above
gdb variant
section and it worked. After that therr record
command worked, but now we are at the stage whererr replay
is broken with the appendix 1 message.Appendix 1: The assertion
Assertion
!syscall_bp_vm' failed to hold.` is triggered at https://github.com/mozilla/rr/blob/153d46c1717e31bf3357abaaf0d42cbe35f6f672/src/ReplaySession.cc#L594Appendix 2: (TODO) Syscall whitelist for systemd-nspawn derived from https://man7.org/linux/man-pages/man2 . https://github.com/mozilla/rr/wiki/Docker lists some information about what would probably be needed but it was easier to just script it (
cat syscalls | cut -d " " -f 8 | cut -d "(" -f 1 | xargs -I{} echo --system-call-filter="{}" | xargs echo
, where syscalls is a hand edited text file) and do everything. Note this only contains IA64 syscalls - ARM and other such syscalls have (perhaps unnecessarily - maybe it doesn't complain about missing ones) removed. If you look at the--system-call-filter
entry on thesystemd-nspawn
man page, it says something about wildcards or something. Maybe it's sufficient to pass "--system-call-filter=~*", but I didn't read carefully enough to tell if it would blacklist or whitelist everything, and I didn`t test this after the man page section was brought to my attention. Also you can pass a list to the argument, so that would also make it a lot nicer.Self TODO: cleaner systemd interactions and figure out what's going on with that stuff, get rr working, clean up the system call filter stuff