send_udp_probe_packet expects dynamically sized buffer larger than the statically sized buffer, smashes stack during initialization of packet with random data

deliciouslytyped commented 4 years ago

Moving from a specific physical wifi network to another (and removing my laptop from the dock (TODO: rule this out)) causes the tincd I have running in a container (TODO: rule this out) to segfault reproducibly, but I haven't been able to isolate the bug. It consistently gets up to the Sending 70 bytes of metadata to [redacted] ([redacted] port 20) portion of the attached log, and then (obviously at some point after this) segfaults.

Starting tincd with gdb results in the stack being incomprehensible (to me). Starting it with rr record results in a stack smashing notification, which probably explains the previous point. However due to an rr replay issue I haven't been able to yield any good info from that either. Maybe valgrind is worth a shot?

Here is a log:

[root@gate:~]# /nix/store/yairbna66xfrib2snjdq9z5gxqxlk4bw-rr-5.3.0/bin/rr record --chaos /nix/store/3218jrqqrh8cbbnhijhh2vxw2m9vm85m-tinc-1.1pre17/bin/tincd -D -U gate -n gate --pidfile /run/gate.pid -d4                                              [166/1678]
rr: Saving execution to trace directory `/root/.local/share/rr/tincd-6'.                                                                                                                                                                                                       
tincd 1.1pre17 (Jan  1 1980 00:00:00) starting, debug level 4                                                                                                                                                                                                                  
Error reading RSA private key file `/etc/tinc/gate/rsa_key.priv': No such file or directory                                             
Create an RSA keypair with `tinc -n gate generate-rsa-keys'.                                                                                                                                                                                                                    
Support for legacy protocol disabled.                                                                                                                                                                                                                                          
/dev/net/tun is a Linux tun/tap device (tun mode)                                                                                                                                                                                                                              
Listening on 0.0.0.0 port 655                                                                                                          
Listening on :: port 655                                                                                                                                                                                                                                                       
Ready                                                                                                                                  
Trying to connect to [redacted] ([redacted] port 20)                                                                                                                                                                                                                              
Connected to [redacted] ([redacted] port 20)                                                                                              
Sending ID to [redacted] ([redacted] port 20): 0 [redacted] 17.7                                                                          
Sending 18 bytes of raw metadata to [redacted] ([redacted] port 20)                                                                                                                                                                                                               
Got ID from [redacted] ([redacted] port 20): 0 [redacted] 17.7                                                                                  
Sending ACK to [redacted] ([redacted] port 20): 4 655 568 700000c                                                                         
Sending 18 bytes of metadata to [redacted] ([redacted] port 20)                                                                           
Got ACK from [redacted] ([redacted] port 20): 4 20 403 700000c                                                                            
Connection with [redacted] ([redacted] port 20) activated                                                                                 
Sending ADD_SUBNET to [redacted] ([redacted] port 20): 10 [redacted] [redacted] 10.0.164.47                                                                                                                                                                                         
Sending 35 bytes of metadata to [redacted] ([redacted] port 20)                                                                           
Sending ADD_EDGE to everyone (BROADCAST): 12 [redacted] [redacted] [redacted] [redacted] 20 700000c 485 10.250.0.5 655                                                                                                                                                               
Sending 71 bytes of metadata to [redacted] ([redacted] port 20)                                                                           
Got ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 10.0.164.5                                               
Forwarding ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 10.0.164.5                                                                                                                                                                                
Got ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 10.0.164.47                                                   
Got ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 10.0.164.2                                                                                                                                                                                                 
Forwarding ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 10.0.164.2                                                  
Got ADD_EDGE from [redacted] ([redacted] port 20): 12 [redacted] [redacted] [redacted] [redacted] 20 700000c 236 192.168.0.100 655                                                                                                                                                          
Forwarding ADD_EDGE from [redacted] ([redacted] port 20): 12 [redacted] [redacted] [redacted] [redacted] 20 700000c 236 192.168.0.100 655                                                                                                                                                   
Got ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 10.0.164.11                                                                                                                                                                                                 
Forwarding ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 10.0.164.11                                                                                                                                                                                          
Got ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 0.0.0.0/0                                                                                                                                                                                                   
Forwarding ADD_SUBNET from [redacted] ([redacted] port 20): 10 [redacted] [redacted] 0.0.0.0/0                                                                                                                                                                                            
Got ADD_EDGE from [redacted] ([redacted] port 20): 12 [redacted] [redacted] [redacted] [redacted] 655 700000c 236 192.168.0.245 20                                                                                                                                                          
Forwarding ADD_EDGE from [redacted] ([redacted] port 20): 12 [redacted] [redacted] [redacted] [redacted] 655 700000c 236 192.168.0.245 20                                                                                                                                                   
Got ADD_EDGE from [redacted] ([redacted] port 20): 12 [redacted] [redacted] [redacted] [redacted] 655 700000c 485 192.168.0.245 20                                                                                                                                                       
Forwarding ADD_EDGE from [redacted] ([redacted] port 20): 12 [redacted] [redacted] [redacted] [redacted] 655 700000c 485 192.168.0.245 20                                                                                                                                                
UDP address of [redacted] set to [redacted] port 20                                                                                                                                                                                                                               
UDP address of [redacted] set to [redacted] port 655                                                                                                                                                                                                                             
Got PACKET from [redacted] ([redacted] port 20): 17 136
Got REQ_KEY from [redacted] ([redacted] port 20): 15 [redacted] [redacted] 15 [redacted]
Sending ANS_KEY to [redacted] ([redacted] port 20): 16 [redacted] [redacted] [redacted] -1 -1 -1 0
Sending 125 bytes of metadata to [redacted] ([redacted] port [redacted])
Got PACKET from [redacted] ([redacted] port 20): 17 18
Got PACKET from [redacted] ([redacted] port 20): 17 18
Got PACKET from [redacted] ([redacted] port 20): 17 143
Got ANS_KEY from [redacted] ([redacted] port 20): 16 [redacted] [redacted] [redacted] -1 -1 -1 0
Sending ANS_KEY to [redacted] ([redacted] port 20): 16 [redacted] [redacted] [redacted] -1 -1 -1 0
Sending 123 bytes of metadata to [redacted] ([redacted] port 20)
SPTPS key exchange with [redacted] ([redacted] port 20) successful
Sending PACKET to [redacted] ([redacted] port 20): 17 70
Sending 6 bytes of metadata to [redacted] ([redacted] port 20)
Sending 70 bytes of metadata to [redacted] ([redacted] port 20)
Autoconnecting to [redacted]
Trying to connect to [redacted] ([redacted] port 655)
Timeout while connecting to [redacted] ([redacted] port 655)
Closing connection with [redacted] ([redacted] port 655)
Cannot open config file /etc/tinc/gate/hosts/[redacted]: No such file or directory
Could not set up a meta connection to [redacted]
Trying to re-establish outgoing connection in 5 seconds
Got PACKET from [redacted] ([redacted] port 20): 17 107
Sending PACKET to [redacted] ([redacted] port 20): 17 70
Sending 6 bytes of metadata to [redacted] ([redacted] port 20)
Sending 70 bytes of metadata to [redacted] ([redacted] port 20)
*** stack smashing detected ***: terminated

One may note (what I assume is) a race condition at Cannot open config file /etc/tinc/gate/hosts/[redacted]: No such file or directory (probably caused by --chaos), which succeeds a few lines later.

Notes on debugging: (mainly as a note to myself), systemd: this is running in a container on nixos, and the main service configs are on a read-only file system. This makes changing the service file to run a tincd with debug symbols, or otherwise, impossible. It turns out to be a bit obscure but systemd-analyze unit-paths list the list of paths systemd searches for service files. However I'm told this "isn't accurate" on nixos. Before that I found a list at https://www.freedesktop.org/software/systemd/man/systemd.unit.html (via https://askubuntu.com/questions/876733/where-are-the-systemd-units-services-located-in-ubuntu) , however I don't know how the list is derived, nor do I know how to change or add entries. Furthermore some of the directories listed seem to be special purpose(randomly putting the .service in some of them didn't work and caused systemd to not see the service or complain). (TODO: look into this)

This was need for two debugging variants: gdb:

one was running a tincd different than the one the container and service definition were created with - namely one with debug symbols (containers mount the /nix/store, meaning anything installed there can be read by the container - this means if we compile a tincd with debug symbols, we can access it by simply referring to it's path). This was needed because I hoped it would make the gdb backtrace after the crash less opaque, but it didn't help in the end. rr
the other was later, trying to get rr to work in the container, to try rewinding before the crash and see what happens before the stack is smashed.

However, in the end it wasn't actually necessary to start tincd via the service files, the issues were reproducible just fine by calling the executable on the command line. I was just trying to keep reproduction in as similar an environment as the original. Part of my confusion was caused by tincd behaving different if started by gdb, than if gdb was just attached to it. This ended up being an issue with aslr, and was fixed with set disable-randomisation off. I don't know what could cause this, but the symptoms were that tinc would seemingly hang instead of crashing.

gdb variant: I ended up copying the containers existing tinc network service file to /run/systemd/system and editing the service files there. I used "nix-store --add" to add the Exec script to the nix store to work around an issue that I didn't look into further, involving when I tried to put the script in the same directory as the service file, and referring to it by absolute path in the service definition ("/run/systemd/system/gate-init" or such). That may have been a problem of me mixing up the outer system and the container, but I'm not sure. It might have been systemd sandboxing related. The gdb command ended up being: lost it in scroll, but its similar to the later rr command, something like gdb path/to/tincd -ex "run -tinc -parameters -here". The tincd path used was the nix store path corresponding to the output of nix-build -I nixpkgs=channel:nixos-unstable -E "with import <nixpkgs> {}; tinc_pre.overrideAttrs (old: { dontStrip = true; })" -v --no-out-link .

rr variant: The first issue while trying to run rr in the container was [FATAL /build/source/src/PerfCounters.cc:317:start_counter() errno: EPERM] Failed to initialize counter and some backtrace that I'm missing. I'd recently heard that rr does stuff with cpu performance counter registers, so I figured this was a container permissions issue (eventually discovering https://github.com/mozilla/rr/wiki/Docker ). So it turns out in the wonderful feat of engineering that systemd is, systemd-nspawn doesn't currently allow disabling it's seccomp protections per: https://lists.freedesktop.org/archives/systemd-devel/2020-June/044756.html . I ended up getting a list of syscalls from https://man7.org/linux/man-pages/man2/syscalls.2.html , cleaning it up a bit, and running cat syscalls | cut -d " " -f 8 | cut -d "(" -f 1 | xargs -I{} echo --system-call-filter="{}" | xargs echo to generate the syscall whitelist. Using the systemd service methods listed above I constructed a nixos-container derived configuration manually (I really should have just generated a new container with the parameters set...). For anyone needing to do such a hack sometime, you can copy it from appendix 2. I also added --capability="CAP_SYS_PTRACE" \ and --capability="CAP_SYS_ADMIN" \ in the hopes that it would be enough, before gebnerating the whitelist, but I don't know if these did anything. Which is to say, I don't know which of all this stuff I did at the same time did anything. The rr command ended up being: /nix/store/yairbna66xfrib2snjdq9z5gxqxlk4bw-rr-5.3.0/bin/rr record /nix/store/3218jrqqrh8cbbnhijhh2vxw2m9vm85m-tinc-1.1pre17/bin/tincd -D -U gate -n gate --pidfile /run/gate.pid -d 4 There was an issue at some point where trying to run the tincd in rr complained about some sort of dynamic linking error:

rr: Saving execution to trace directory `/root/.local/share/rr/tincd-0'.
/nix/store/9d3khdjm3m020zmpwfk799ymf447xx6s-tinc-1.1pre17/bin/tincd: relocation error: /nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31/lib/libpthread.so.0: symbol __write_nocancel version GLIBC_PRIVATE not defined in file libc.so.6 with link time reference

I don't know the cause of this but I probably used the wrong tincd path (nixos lets you link different versions of stuff against eachother while on the same system), I just got the one from the above gdb variant section and it worked. After that the rr record command worked, but now we are at the stage where rr replay is broken with the appendix 1 message.

Appendix 1: The assertion Assertion!syscall_bp_vm' failed to hold.` is triggered at https://github.com/mozilla/rr/blob/153d46c1717e31bf3357abaaf0d42cbe35f6f672/src/ReplaySession.cc#L594

$ rr replay /var/lib/containers/gate/root/.local/share/rr/tincd-7                                                                                         
GNU gdb (GDB) 9.2                                                                                                                                                                                                                                                              
Copyright (C) 2020 Free Software Foundation, Inc.                                                                                                                                                                                                                              License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>                                                          
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.                                                                                  
Type "show copying" and "show warranty" for details.                                                                                                                                                                                                                           
This GDB was configured as "x86_64-unknown-linux-gnu".                                                                                                                                                                                                                         Type "show configuration" for configuration details.                                                                                   
For bug reporting instructions, please see:                                                                                            
<http://www.gnu.org/software/gdb/bugs/>.                                                                                               
Find the GDB manual and other documentation resources online at:                                                                                                                                                                                                               
    <http://www.gnu.org/software/gdb/documentation/>.                                                                                  

For help, type "help".                              
Type "apropos word" to search for commands related to "word"...                                                                        
Reading symbols from /nix/store/3218jrqqrh8cbbnhijhh2vxw2m9vm85m-tinc-1.1pre17/bin/tincd...                                                                                                                                                                                    
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]                                             
Remote debugging using 127.0.0.1:27767                                                                                                 
Reading symbols from /nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31/lib/ld-linux-x86-64.so.2...                                
(No debugging symbols found in /nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31/lib/ld-linux-x86-64.so.2)                        
0x00007f02c7d3d090 in _start () from /nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31/lib/ld-linux-x86-64.so.2                                                                                                                                                           
(rr) c                                                                                                                                 
Continuing.                                                                                                                            
[FATAL /build/source/src/ReplaySession.cc:570:enter_syscall()]                                                                         
 (task 27768 (rec:406) at time 385)                                                                                                    
 -> Assertion `!syscall_bp_vm' failed to hold.                                                                                                                                                                                                                                 
Tail of trace dump:                                                                                                                    
{                                                                                                                                                                                                                                                                              
  real_time:1450867.692367 global_time:365, event:`INSTRUCTION_TRAP' tid:406, ticks:140528                                        
rax:0x1c004121 rbx:0x1c0003f rcx:0x3f rdx:0x0 rsi:0xf0b5ff rdi:0xbc rbp:0x3 rsp:0x7ffeeb8176c0 r8:0x7f02c7d65698 r9:0x7f02c7d6569c r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770f r13:0x7f02c784d5e0 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c7775183 eflags:0x10246 cs:
0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.692623 global_time:366, event:`INSTRUCTION_TRAP' tid:406, ticks:140535
rax:0x76036301 rbx:0xf0b5ff rcx:0x0 rdx:0xc30000 rsi:0x0 rdi:0xbf rbp:0xbf rsp:0x7ffeeb8176f0 r8:0x4 r9:0x7f02c7d65601 r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770e r13:0x1 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c777534a eflags:0x10202 cs:0x33 ss:0x2b ds:0x0 es:
0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.692787 global_time:367, event:`INSTRUCTION_TRAP' tid:406, ticks:140640
rax:0x1c004121 rbx:0x1c0003f rcx:0x3f rdx:0x0 rsi:0xf0b5ff rdi:0xbf rbp:0x6 rsp:0x7ffeeb8176c0 r8:0x7f02c7d65698 r9:0x7f02c7d6569c r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770f r13:0x7f02c784d5e0 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c7775183 eflags:0x10246 cs:
0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.692882 global_time:368, event:`INSTRUCTION_TRAP' tid:406, ticks:140644
rax:0x1c004122 rbx:0x1c0003f rcx:0x3f rdx:0x0 rsi:0x1 rdi:0xbf rbp:0x6 rsp:0x7ffeeb8176c0 r8:0x4 r9:0x7f02c7d65601 r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770f r13:0x7f02c784d5e0 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c77751ac eflags:0x10202 cs:0x33 ss:0x2b ds:
0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.692973 global_time:369, event:`INSTRUCTION_TRAP' tid:406, ticks:140649
rax:0x1c004143 rbx:0x1c0003f rcx:0x1ff rdx:0x0 rsi:0x2 rdi:0xbf rbp:0x6 rsp:0x7ffeeb8176c0 r8:0x4 r9:0x7f02c7d65601 r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770f r13:0x7f02c784d5e0 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c77751ac eflags:0x10202 cs:0x33 ss:0x2b ds
:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693066 global_time:370, event:`INSTRUCTION_TRAP' tid:406, ticks:140657
rax:0x76036301 rbx:0xf0b5ff rcx:0x0 rdx:0xc30000 rsi:0x2 rdi:0xc2 rbp:0xc2 rsp:0x7ffeeb8176f0 r8:0x4 r9:0x7f02c7d65600 r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770e r13:0x1 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c777534a eflags:0x10202 cs:0x33 ss:0x2b ds:0x0 es:
0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693174 global_time:371, event:`INSTRUCTION_TRAP' tid:406, ticks:140762
rax:0x1c004121 rbx:0x1c0003f rcx:0x3f rdx:0x0 rsi:0xf0b5ff rdi:0xc2 rbp:0x9 rsp:0x7ffeeb8176c0 r8:0x7f02c7d65698 r9:0x7f02c7d6569c r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770f r13:0x7f02c784d5e0 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c7775183 eflags:0x10246 cs:
0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693291 global_time:372, event:`INSTRUCTION_TRAP' tid:406, ticks:140766
rax:0x1c004122 rbx:0x1c0003f rcx:0x3f rdx:0x0 rsi:0x1 rdi:0xc2 rbp:0x9 rsp:0x7ffeeb8176c0 r8:0x4 r9:0x7f02c7d65601 r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770f r13:0x7f02c784d5e0 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c77751ac eflags:0x10202 cs:0x33 ss:0x2b ds:
0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693387 global_time:373, event:`INSTRUCTION_TRAP' tid:406, ticks:140771
rax:0x1c004143 rbx:0x1c0003f rcx:0x1ff rdx:0x0 rsi:0x2 rdi:0xc2 rbp:0x9 rsp:0x7ffeeb8176c0 r8:0x4 r9:0x7f02c7d65601 r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770f r13:0x7f02c784d5e0 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c77751ac eflags:0x10202 cs:0x33 ss:0x2b ds
:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693490 global_time:374, event:`INSTRUCTION_TRAP' tid:406, ticks:140778
rax:0x1c03c163 rbx:0x2c0003f rcx:0xfff rdx:0x6 rsi:0x3 rdi:0xc2 rbp:0x9 rsp:0x7ffeeb8176c0 r8:0x4 r9:0x7f02c7d65600 r10:0x0 r11:0x7f02c784d710 r12:0x7ffeeb81770f r13:0x7f02c784d5e0 r14:0x7f02c7d655c0 r15:0x7ffeeb81770f rip:0x7f02c77751ac eflags:0x10206 cs:0x33 ss:0x2b ds
:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693590 global_time:375, event:`INSTRUCTION_TRAP' tid:406, ticks:140790
rax:0x1c004121 rbx:0x1c0003f rcx:0x3f rdx:0x0 rsi:0x1 rdi:0x3 rbp:0x300000 rsp:0x7ffeeb817750 r8:0x1 r9:0x4 r10:0x0 r11:0x0 r12:0x7f02c7d655c0 r13:0x8000 r14:0x0 r15:0x0 rip:0x7f02c76f5805 eflags:0x10246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffff
ffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693694 global_time:376, event:`INSTRUCTION_TRAP' tid:406, ticks:140794
rax:0x1c004122 rbx:0x1c0003f rcx:0x3f rdx:0x0 rsi:0x1 rdi:0x3 rbp:0x300000 rsp:0x7ffeeb817750 r8:0x2 r9:0x4 r10:0x0 r11:0x0 r12:0x7f02c7d655c0 r13:0x8000 r14:0x0 r15:0x0 rip:0x7f02c76f5805 eflags:0x10206 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffff
ffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693802 global_time:377, event:`INSTRUCTION_TRAP' tid:406, ticks:140798
rax:0x1c004143 rbx:0x1c0003f rcx:0x1ff rdx:0x0 rsi:0x1 rdi:0x3 rbp:0x300000 rsp:0x7ffeeb817750 r8:0x3 r9:0x4 r10:0x0 r11:0x0 r12:0x7f02c7d655c0 r13:0x8000 r14:0x0 r15:0x0 rip:0x7f02c76f5805 eflags:0x10206 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xfffffffffff
fffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.693912 global_time:378, event:`INSTRUCTION_TRAP' tid:406, ticks:140802
rax:0x1c03c163 rbx:0x2c0003f rcx:0xfff rdx:0x6 rsi:0x1 rdi:0x2 rbp:0x300000 rsp:0x7ffeeb817750 r8:0x4 r9:0x4 r10:0x0 r11:0x1 r12:0x7f02c7d655c0 r13:0x8000 r14:0x0 r15:0x0 rip:0x7f02c76f5805 eflags:0x10202 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xfffffffffff
fffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.694018 global_time:379, event:`INSTRUCTION_TRAP' tid:406, ticks:140810
rax:0x1 rbx:0x2 rcx:0x100 rdx:0x1 rsi:0x1 rdi:0x0 rbp:0x300000 rsp:0x7ffeeb817750 r8:0x3 r9:0x1 r10:0xf r11:0x1 r12:0x7f02c7d655c0 r13:0x8000 r14:0x0 r15:0x0 rip:0x7f02c76f59b8 eflags:0x10206 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base
:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.694119 global_time:380, event:`INSTRUCTION_TRAP' tid:406, ticks:140815
rax:0x4 rbx:0x4 rcx:0x201 rdx:0x1 rsi:0x1 rdi:0x0 rbp:0x300000 rsp:0x7ffeeb817750 r8:0x2 r9:0x2 r10:0xf r11:0x1 r12:0x7f02c7d655c0 r13:0x8000 r14:0x0 r15:0x0 rip:0x7f02c76f59b8 eflags:0x10202 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base
:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.694289 global_time:381, event:`INSTRUCTION_TRAP' tid:406, ticks:140973
rax:0x14 rbx:0x756e6547 rcx:0x6c65746e rdx:0x49656e69 rsi:0x7ffeeb817818 rdi:0x7f02c7d07fb0 rbp:0x0 rsp:0x7ffeeb817738 r8:0x7f02c7d3a6a0 r9:0x2f r10:0x3 r11:0x2 r12:0x7ffeeb817818 r13:0x7f02c7d07fb0 r14:0x7f02c7d67130 r15:0x0 rip:0x7f02c7c48bfb eflags:0x10246 cs:0x33 ss:
0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.694404 global_time:382, event:`INSTRUCTION_TRAP' tid:406, ticks:140975
rax:0x1c004121 rbx:0x1c0003f rcx:0x3f rdx:0x0 rsi:0x7ffeeb817818 rdi:0x7f02c7d07fb0 rbp:0x0 rsp:0x7ffeeb817738 r8:0x7f02c7d3a6a0 r9:0x0 r10:0xffffffff r11:0x14 r12:0x7ffeeb817818 r13:0x7f02c7d07fb0 r14:0x7f02c7d67130 r15:0x0 rip:0x7f02c7c48cc9 eflags:0x10202 cs:0x33 ss:0
x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.694508 global_time:383, event:`INSTRUCTION_TRAP' tid:406, ticks:140975
rax:0x306d4 rbx:0x1100800 rcx:0x3ffafbbf rdx:0xbfebfbff rsi:0x7ffeeb817818 rdi:0x7f02c7d07fb0 rbp:0x0 rsp:0x7ffeeb817738 r8:0x7f02c7d3a6a0 r9:0x0 r10:0x1 r11:0x14 r12:0x7ffeeb817818 r13:0x7f02c7d07fb0 r14:0x7f02c7d67130 r15:0x0 rip:0x7f02c7c48cde eflags:0x10202 cs:0x33 s
s:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.694609 global_time:384, event:`INSTRUCTION_TRAP' tid:406, ticks:140984
rax:0x0 rbx:0x21827ab rcx:0x0 rdx:0x0 rsi:0x7ffeeb817818 rdi:0x7f02c7d07fb0 rbp:0x0 rsp:0x7ffeeb817738 r8:0x7f02c7d3a6a0 r9:0x3ffaf3bf r10:0xffebfbff r11:0x14 r12:0x7ffeeb817818 r13:0x7f02c7d07fb0 r14:0x7f02c7d67130 r15:0x0 rip:0x7f02c7c48d68 eflags:0x10246 cs:0x33 ss:0x
2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.694787 global_time:385, event:`SYSCALL: futex' (state:ENTERING_SYSCALL) tid:406, ticks:141106
rax:0xffffffffffffffda rbx:0x7f02c79d8048 rcx:0xffffffffffffffff rdx:0x7fffffff rsi:0x81 rdi:0x7f02c79d8048 rbp:0x7f02c79d54e0 rsp:0x7ffeeb817540 r8:0x7f02c76cd080 r9:0x1 r10:0x0 r11:0x246 r12:0x7ffeeb817540 r13:0x7ffeeb817868 r14:0x7f02c7d32da8 r15:0x0 rip:0x7f02c76c141
e eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f02c76afe00 gs_base:0x0
}
{
  real_time:1450867.694882 global_time:386, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:406, ticks:141106
rax:0x0 rbx:0x7f02c79d8048 rcx:0xffffffffffffffff rdx:0x7fffffff rsi:0x81 rdi:0x7f02c79d8048 rbp:0x7f02c79d54e0 rsp:0x7ffeeb817540 r8:0x7f02c76cd080 r9:0x1 r10:0x0 r11:0x246 r12:0x7ffeeb817540 r13:0x7ffeeb817868 r14:0x7f02c7d32da8 r15:0x0 rip:0x7f02c76c141e eflags:0x246 
cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f02c76afe00 gs_base:0x0
}
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x44)[0x89f462]
rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x1a2)[0x71d802]
rr[0x743fa0]
rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x62)[0x7441fa]
rr(_ZN2rr13ReplaySession13enter_syscallEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x6d3)[0x8033d7]
rr(_ZN2rr13ReplaySession18try_one_trace_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0xf5)[0x806f23]
rr(_ZN2rr13ReplaySession11replay_stepERKNS0_15StepConstraintsE+0x15b)[0x80830d]
rr(_ZN2rr14ReplayTimeline19replay_step_forwardENS_10RunCommandEl+0xe9)[0x823e61]
rr(_ZN2rr9GdbServer14debug_one_stepERNS_10GdbRequestE+0x555)[0x71acbf]
rr(_ZN2rr9GdbServer12serve_replayERKNS0_15ConnectionFlagsE+0x4b9)[0x71c993]
rr[0x7fe95f]
rr(_ZN2rr13ReplayCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x416)[0x7ff3a0]
rr(main+0x227)[0x8b7a7b]
/nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31/lib/libc.so.6(__libc_start_main+0xed)[0x7f15e72a1c7d]
rr(_start+0x2a)[0x68c4da]
=== End rr backtrace
Launch gdb with
  gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:27768' /nix/store/3218jrqqrh8cbbnhijhh2vxw2m9vm85m-tinc-1.1pre17/bin/tincd

Appendix 2: (TODO) Syscall whitelist for systemd-nspawn derived from https://man7.org/linux/man-pages/man2 . https://github.com/mozilla/rr/wiki/Docker lists some information about what would probably be needed but it was easier to just script it (cat syscalls | cut -d " " -f 8 | cut -d "(" -f 1 | xargs -I{} echo --system-call-filter="{}" | xargs echo, where syscalls is a hand edited text file) and do everything. Note this only contains IA64 syscalls - ARM and other such syscalls have (perhaps unnecessarily - maybe it doesn't complain about missing ones) removed. If you look at the --system-call-filter entry on the systemd-nspawn man page, it says something about wildcards or something. Maybe it's sufficient to pass "--system-call-filter=~*", but I didn't read carefully enough to tell if it would blacklist or whitelist everything, and I didn`t test this after the man page section was brought to my attention. Also you can pass a list to the argument, so that would also make it a lot nicer.

--system-call-filter=_llseek --system-call-filter=_newselect --system-call-filter=_sysctl --system-call-filter=accept --system-call-filter=accept4 --system-call-filter=access --system-call-filter=acct --system-call-filter=add_key --system-call-filter=adjtimex --system-call-filter=alarm --system-call-filter=arch_prctl --system-call-filter=bind --system-call-filter=bpf --system-call-filter=brk --system-call-filter=capget --system-call-filter=capset --system-call-filter=chdir --system-call-filter=chmod --system-call-filter=chown --system-call-filter=chown32 --system-call-filter=chroot --system-call-filter=clock_adjtime --system-call-filter=clock_getres --system-call-filter=clock_gettime --system-call-filter=clock_nanosleep --system-call-filter=clock_settime --system-call-filter=clone2 --system-call-filter=clone --system-call-filter=clone3 --system-call-filter=close --system-call-filter=connect --system-call-filter=copy_file_range --system-call-filter=creat --system-call-filter=delete_module --system-call-filter=dup --system-call-filter=dup2 --system-call-filter=dup3 --system-call-filter=epoll_create --system-call-filter=epoll_create1 --system-call-filter=epoll_ctl --system-call-filter=epoll_pwait --system-call-filter=epoll_wait --system-call-filter=eventfd --system-call-filter=eventfd2 --system-call-filter=execve --system-call-filter=execveat --system-call-filter=exit --system-call-filter=exit_group --system-call-filter=faccessat --system-call-filter=faccessat2 --system-call-filter=fadvise64 --system-call-filter=fadvise64_64 --system-call-filter=fallocate --system-call-filter=fanotify_init --system-call-filter=fanotify_mark --system-call-filter=fchdir --system-call-filter=fchmod --system-call-filter=fchmodat --system-call-filter=fchown --system-call-filter=fchown32 --system-call-filter=fchownat --system-call-filter=fcntl --system-call-filter=fcntl64 --system-call-filter=fdatasync --system-call-filter=fgetxattr --system-call-filter=finit_module --system-call-filter=flistxattr --system-call-filter=flock --system-call-filter=fork --system-call-filter=fremovexattr --system-call-filter=fsconfig --system-call-filter=fsetxattr --system-call-filter=fsmount --system-call-filter=fsopen --system-call-filter=fspick --system-call-filter=fstat --system-call-filter=fstat64 --system-call-filter=fstatat64 --system-call-filter=fstatfs --system-call-filter=fstatfs64 --system-call-filter=fsync --system-call-filter=ftruncate --system-call-filter=ftruncate64 --system-call-filter=futex --system-call-filter=futimesat --system-call-filter=get_mempolicy --system-call-filter=get_robust_list --system-call-filter=get_thread_area --system-call-filter=getcpu --system-call-filter=getcwd --system-call-filter=getdents --system-call-filter=getdents64 --system-call-filter=getegid --system-call-filter=getegid32 --system-call-filter=geteuid --system-call-filter=geteuid32 --system-call-filter=getgid --system-call-filter=getgid32 --system-call-filter=getgroups --system-call-filter=getgroups32 --system-call-filter=getitimer --system-call-filter=getpeername --system-call-filter=getpgid --system-call-filter=getpgrp --system-call-filter=getpid --system-call-filter=getppid --system-call-filter=getpriority --system-call-filter=getrandom --system-call-filter=getresgid --system-call-filter=getresgid32 --system-call-filter=getresuid --system-call-filter=getresuid32 --system-call-filter=getrlimit --system-call-filter=getrusage --system-call-filter=getsid --system-call-filter=getsockname --system-call-filter=getsockopt --system-call-filter=gettid --system-call-filter=gettimeofday --system-call-filter=getuid --system-call-filter=getuid32 --system-call-filter=getunwind --system-call-filter=getxattr --system-call-filter=init_module --system-call-filter=inotify_add_watch --system-call-filter=inotify_init --system-call-filter=inotify_init1 --system-call-filter=inotify_rm_watch --system-call-filter=io_cancel --system-call-filter=io_destroy --system-call-filter=io_getevents --system-call-filter=io_pgetevents --system-call-filter=io_setup --system-call-filter=io_submit --system-call-filter=io_uring_enter --system-call-filter=io_uring_register --system-call-filter=io_uring_setup --system-call-filter=ioctl --system-call-filter=ioperm --system-call-filter=iopl --system-call-filter=ioprio_get --system-call-filter=ioprio_set --system-call-filter=ipc --system-call-filter=kcmp --system-call-filter=kexec_file_load --system-call-filter=kexec_load --system-call-filter=keyctl --system-call-filter=kill --system-call-filter=lchown --system-call-filter=lchown32 --system-call-filter=lgetxattr --system-call-filter=link --system-call-filter=linkat --system-call-filter=listen --system-call-filter=listxattr --system-call-filter=llistxattr --system-call-filter=lookup_dcookie --system-call-filter=lremovexattr --system-call-filter=lseek --system-call-filter=lsetxattr --system-call-filter=lstat --system-call-filter=lstat64 --system-call-filter=madvise --system-call-filter=mbind --system-call-filter=membarrier --system-call-filter=memfd_create --system-call-filter=migrate_pages --system-call-filter=mincore --system-call-filter=mkdir --system-call-filter=mkdirat --system-call-filter=mknod --system-call-filter=mknodat --system-call-filter=mlock --system-call-filter=mlock2 --system-call-filter=mlockall --system-call-filter=mmap --system-call-filter=mmap2 --system-call-filter=modify_ldt --system-call-filter=mount --system-call-filter=move_mount --system-call-filter=move_pages --system-call-filter=mprotect --system-call-filter=mq_getsetattr --system-call-filter=mq_notify --system-call-filter=mq_open --system-call-filter=mq_timedreceive --system-call-filter=mq_timedsend --system-call-filter=mq_unlink --system-call-filter=mremap --system-call-filter=msgctl --system-call-filter=msgget --system-call-filter=msgrcv --system-call-filter=msgsnd --system-call-filter=msync --system-call-filter=munlock --system-call-filter=munlockall --system-call-filter=munmap --system-call-filter=name_to_handle_at --system-call-filter=nanosleep --system-call-filter=newfstatat --system-call-filter=nice --system-call-filter=oldfstat --system-call-filter=oldlstat --system-call-filter=oldolduname --system-call-filter=oldstat --system-call-filter=olduname --system-call-filter=open --system-call-filter=open_by_handle_at --system-call-filter=open_tree --system-call-filter=openat --system-call-filter=openat2 --system-call-filter=pause --system-call-filter=perf_event_open --system-call-filter=personality --system-call-filter=perfmonctl --system-call-filter=pidfd_getfd --system-call-filter=pidfd_send_signal --system-call-filter=pidfd_open --system-call-filter=pipe --system-call-filter=pipe2 --system-call-filter=pivot_root --system-call-filter=pkey_alloc --system-call-filter=pkey_free --system-call-filter=pkey_mprotect --system-call-filter=poll --system-call-filter=ppoll --system-call-filter=prctl --system-call-filter=pread64 --system-call-filter=preadv --system-call-filter=preadv2 --system-call-filter=prlimit64 --system-call-filter=process_vm_readv --system-call-filter=process_vm_writev --system-call-filter=pselect6 --system-call-filter=ptrace --system-call-filter=pwrite64 --system-call-filter=pwritev --system-call-filter=pwritev2 --system-call-filter=query_module --system-call-filter=quotactl --system-call-filter=read --system-call-filter=readahead --system-call-filter=readdir --system-call-filter=readlink --system-call-filter=readlinkat --system-call-filter=readv --system-call-filter=reboot --system-call-filter=recv --system-call-filter=recvfrom --system-call-filter=recvmsg --system-call-filter=recvmmsg --system-call-filter=remap_file_pages --system-call-filter=removexattr --system-call-filter=rename --system-call-filter=renameat --system-call-filter=renameat2 --system-call-filter=request_key --system-call-filter=restart_syscall --system-call-filter=rmdir --system-call-filter=rseq --system-call-filter=rt_sigaction --system-call-filter=rt_sigpending --system-call-filter=rt_sigprocmask --system-call-filter=rt_sigqueueinfo --system-call-filter=rt_sigreturn --system-call-filter=rt_sigsuspend --system-call-filter=rt_sigtimedwait --system-call-filter=rt_tgsigqueueinfo --system-call-filter=sched_get_priority_max --system-call-filter=sched_get_priority_min --system-call-filter=sched_getaffinity --system-call-filter=sched_getattr --system-call-filter=sched_getparam --system-call-filter=sched_getscheduler --system-call-filter=sched_rr_get_interval --system-call-filter=sched_setaffinity --system-call-filter=sched_setattr --system-call-filter=sched_setparam --system-call-filter=sched_setscheduler --system-call-filter=sched_yield --system-call-filter=seccomp --system-call-filter=select --system-call-filter=semctl --system-call-filter=semget --system-call-filter=semop --system-call-filter=semtimedop --system-call-filter=send --system-call-filter=sendfile --system-call-filter=sendfile64 --system-call-filter=sendmmsg --system-call-filter=sendmsg --system-call-filter=sendto --system-call-filter=set_mempolicy --system-call-filter=set_robust_list --system-call-filter=set_thread_area --system-call-filter=set_tid_address --system-call-filter=setdomainname --system-call-filter=setfsgid --system-call-filter=setfsgid32 --system-call-filter=setfsuid --system-call-filter=setfsuid32 --system-call-filter=setgid --system-call-filter=setgid32 --system-call-filter=setgroups --system-call-filter=setgroups32 --system-call-filter=sethostname --system-call-filter=setitimer --system-call-filter=setns --system-call-filter=setpgid --system-call-filter=setpriority --system-call-filter=setregid --system-call-filter=setregid32 --system-call-filter=setresgid --system-call-filter=setresgid32 --system-call-filter=setresuid --system-call-filter=setresuid32 --system-call-filter=setreuid --system-call-filter=setreuid32 --system-call-filter=setrlimit --system-call-filter=setsid --system-call-filter=setsockopt --system-call-filter=settimeofday --system-call-filter=setuid --system-call-filter=setuid32 --system-call-filter=setxattr --system-call-filter=sgetmask --system-call-filter=shmat --system-call-filter=shmctl --system-call-filter=shmdt --system-call-filter=shmget --system-call-filter=shutdown --system-call-filter=sigaction --system-call-filter=sigaltstack --system-call-filter=signal --system-call-filter=signalfd --system-call-filter=signalfd4 --system-call-filter=sigpending --system-call-filter=sigprocmask --system-call-filter=sigreturn --system-call-filter=sigsuspend --system-call-filter=socket --system-call-filter=socketcall --system-call-filter=socketpair --system-call-filter=splice --system-call-filter=ssetmask --system-call-filter=stat --system-call-filter=stat64 --system-call-filter=statfs --system-call-filter=statfs64 --system-call-filter=statx --system-call-filter=stime --system-call-filter=swapoff --system-call-filter=swapon --system-call-filter=symlink --system-call-filter=symlinkat --system-call-filter=sync --system-call-filter=sync_file_range --system-call-filter=sync_file_range2 --system-call-filter=syncfs --system-call-filter=sysfs --system-call-filter=sysinfo --system-call-filter=syslog --system-call-filter=tee --system-call-filter=tgkill --system-call-filter=time --system-call-filter=timer_create --system-call-filter=timer_delete --system-call-filter=timer_getoverrun --system-call-filter=timer_gettime --system-call-filter=timer_settime --system-call-filter=timerfd_create --system-call-filter=timerfd_gettime --system-call-filter=timerfd_settime --system-call-filter=times --system-call-filter=tkill --system-call-filter=truncate --system-call-filter=truncate64 --system-call-filter=ugetrlimit --system-call-filter=umask --system-call-filter=umount --system-call-filter=umount2 --system-call-filter=uname --system-call-filter=unlink --system-call-filter=unlinkat --system-call-filter=unshare --system-call-filter=uselib --system-call-filter=ustat --system-call-filter=userfaultfd --system-call-filter=utime --system-call-filter=utimensat --system-call-filter=utimes --system-call-filter=vfork --system-call-filter=vhangup --system-call-filter=vm86old --system-call-filter=vm86 --system-call-filter=vmsplice --system-call-filter=wait4 --system-call-filter=waitid --system-call-filter=waitpid --system-call-filter=write --system-call-filter=writev

Self TODO: cleaner systemd interactions and figure out what's going on with that stuff, get rr working, clean up the system call filter stuff

deliciouslytyped commented 4 years ago

An attempt at debugging that I forogt to mention, also the most fruitful so far, was trying to use strace. A short excerpt is attached here:

....
read(3, ",placeholder"..., 498) = 498                            
sendto(6, "placeholder"..., 545, 0, {sa_family=AF_INET, sin_port=htons(1676), sin_addr=inet_addr("[redacted]")}, 16) = -1 EMSGSIZE (Message too long)                                                              
read(3, 0x7ffdbe246372, 66033)          = -1 EFAULT (Bad address)                                                                      
read(3, 0x7ffdbe246371, 66034)          = -1 EFAULT (Bad address)                                                                                                                                                                                                              
read(3, 0x7ffdbe246370, 66035)          = -1 EFAULT (Bad address)                                                                      
read(3, 0x7ffdbe24636f, 66036)          = -1 EFAULT (Bad address)                                                                                                                                                                                                              
read(3, 0x7ffdbe24636e, 66037)          = -1 EFAULT (Bad address)                                                                      
read(3, 0x7ffdbe24636d, 66038)          = -1 EFAULT (Bad address)                                                                                                                                                                                                              
read(3, 0x7ffdbe24636c, 66039)          = -1 EFAULT (Bad address)                                                                      
read(3, 0x7ffdbe24636b, 66040)          = -1 EFAULT (Bad address)
....

There's a bunch of sendto() and reads() with the return value (?) decreasing one by one, and then a bunch of those faults. I'm assuming this is what trashes the stack.

TODO: get a better strace and don't axe so much info

deliciouslytyped commented 4 years ago

TL;DR on the rr issue is that rr uses some of it's own syscalls (or something like that) for internal communication, and systemd-nspawn didn't have them whitelisted, because my whitelist was based on the man page enumeration of the kernels built in syscall names. Furthermore rr (IIRC) didn't warn or abort about this in version 5.3.0 which was the original version I was using.

rr master currently contains checks and notifies the user if this is an issue, as well as suggesting the -n flag which avoids it. I was able to successfully create an rr recording of tinc in the container with rr record -n ..., this should allow (me) to reproducibly debug the issue, while also not having access to the specific network that causes the crash.

deliciouslytyped commented 4 years ago

My current reading is the following:

https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net_packet.c#L1338 is passing 18+65535 as the length argument while https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net_packet.c#L1085 allocates a vpn packet with a buffer size of 1673 for some reason, per:

gef➤  ptype vpn_packet_t 
type = struct vpn_packet_t {
    length_t len;
    length_t offset;
    int priority;
    uint8_t data[1673];
}

...namely https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net.h#L36 in https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net.h#L91-L96 I don't know why this isn't causing weird crashes all over the place on other people's machines. So basically the stack gets smashed with 64k of presumably random data by a read() call in randomize()? Once that overwrites the stack, the stack smashing check at the end of the probe function triggers, causing the program to abort.

What I still don't understand is why this crash triggers reliably, in the same piece of tinc code, on a single specific network, every (? - or at least almost every) time I try to start tincd.

deliciouslytyped commented 4 years ago

Another possible interpretation is: why is an offset so big? - or is it underflowing or -1 or something?

https://github.com/gsliepen/tinc/blob/2b74e1b01af2d56d6e7ebc135143fbe81f6ca455/src/net_packet.c#L1335

gef➤  ptype length_t
type = unsigned short

I.e. a two byte value.

The assembly code between the pow function and the probe call, including the addition is here:

   0x0000560de58b5d39 <+345>:   call   0x560de58ab0a0 <powf@plt>                                                                       
   0x0000560de58b5d3e <+350>:   mov    rdi,r12                                                                                         
   0x0000560de58b5d41 <+353>:   movzx  r13d,WORD PTR [r12+0x23e]
   0x0000560de58b5d4a <+362>:   cvttss2si esi,xmm0                                                                                     
   0x0000560de58b5d4e <+366>:   movzx  esi,si // changes esi from 0xffffffff to 0xffff
   0x0000560de58b5d51 <+369>:   add    esi,r14d                                                                                        
   0x0000560de58b5d54 <+372>:   call   0x560de58b5930 <send_udp_probe_packet>

I can't just inspect the value of offset trivially because its optimized out.

The series of offset values decreases one by one through 1328..512 and then in the next cycle its 66407. (-or something like that, I mightve mixed things up) I haven't figured out what exactly is going on yet.

deliciouslytyped commented 4 years ago

[ ] Sidenote, it might be worth looking into runnng tinc under valgrind regularly.

Nable80 commented 4 years ago

You might also be interested in compiling it with -g -fsanitize=address,undefined to use ASAN+UBSAN, they are supported by both GCC and Clang now. Sanitizers detect some problems better than valgrind (although there are still problems which are better detected by valgrind than by ASAN/UBSAN/MSAN) and has less overhead in run-time and (IMHO) slightly better diagnostic output.

Upd:

const length_t offset = powf(...)

Oh, this is a definitely bad construction that could be detected by -Wnarrowing or -Wconversion compiler options.

deliciouslytyped commented 4 years ago

From what little I figured out, I don't think the cast was what was actually responsible for the value? At least not directly. I imagine the problem is why 0x0000560de58b5d4a <+362>: cvttss2si esi,xmm0 loads 0xffffffff (or however long it was) into esi in the first place? Thats -1 right? Or some floating point special value? I'm probably completely misinterpreting what's going on here.

Oh - or is this a signed-unsigned conversion issue and it is just misinterpreting a -1?

deliciouslytyped commented 4 years ago

Ok, I did what I should have done anyway and checked what the value is (e.g.) https://stackoverflow.com/questions/13673045/what-will-be-the-value-in-float-if-i-have-a-binary-number-as-1111111111111111-an , and it is in fact -nan on my machine.

So the question if which case in https://en.cppreference.com/w/c/numeric/math/pow is triggered? I still haven't managed to read out the float args properly in the case of the failing iteration I thinkm so that's probably the next task unless someone can tell what's wrong by staring at the code.

deliciouslytyped commented 4 years ago

If I'm doing this right, powf() is getting called with powf(-1, 1);

gef➤  print $xmm1
$69 = {
  v4_float = {1, 0, 0, 0},
  v2_double = {5.2635442471208903e-315, 0},
  v16_int8 = {0x0, 0x0, 0x80, 0x3f, 0x0 <repeats 12 times>},
  v8_int16 = {0x0, 0x3f80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v4_int32 = {0x3f800000, 0x0, 0x0, 0x0},
  v2_int64 = {0x3f800000, 0x0},
  uint128 = 0x3f800000
}
gef➤  print $xmm0
$70 = {
  v4_float = {-1, 0, 0, 0},
  v2_double = {1.5873523201947252e-314, 0},
  v16_int8 = {0x0, 0x0, 0x80, 0xbf, 0x0 <repeats 12 times>},
  v8_int16 = {0x0, 0xbf80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  v4_int32 = {0xbf800000, 0x0, 0x0, 0x0},
  v2_int64 = {0xbf800000, 0x0},
  uint128 = 0xbf800000
}

(-1)^1 = -1, which would explain why the return value is -1.

That doesn't make sense though because interval should be 512?

fangfufu commented 3 years ago

I am just labelling / cleaning up the issue list. This is the most detailed investigation I have come across so far. Thanks for all the effort you put in.

splitice commented 3 years ago

pow can be underflowed if n->maxmtu is less than 512 due to https://github.com/gsliepen/tinc/blob/1.1/src/net_packet.c#L1328

Or in other words minmtu is not necessarily less than maxmtu.

Although the MTU isnt that small normally I'm guessing it can be erroneously lowered?

Anyway PR to follow shortly.

splitice commented 3 years ago

Oh and thanks @deliciouslytyped for all your troubleshooting. That powf(-1, ...) was all the hint I needed.

fangfufu commented 3 years ago

Fixed via https://github.com/gsliepen/tinc/pull/273#issuecomment-871753419

gsliepen / tinc

send_udp_probe_packet expects dynamically sized buffer larger than the statically sized buffer, smashes stack during initialization of packet with random data #252