macOS: peers inaccessible after long uptime

mcginty commented 1 year ago

Unfortunately I haven't found the root cause yet, but occasionally innernet on macOS will get into a state where peers are no longer accessible, and innernet up doesn't fix the problem if you don't do innernet down before hand to reset all the states/tunnels.

The test innernet subnet is fd00:1337::/48.

I just hit this condition, so dumping some debug output here for later investigation.

tldr: at first glance, the routes and interfaces look normal, but I can't even ping my local wireguard IP via the loopback interface (lo0). Something is weird.

`netstat -rn` output:

Internet6:
Destination                             Gateway                         Flags           Netif Expire
default                                 fe80::%utun0                    UGcIg           utun0
default                                 fe80::%utun1                    UGcIg           utun1
default                                 fe80::%utun2                    UGcIg           utun2
default                                 fe80::%utun3                    UGcIg           utun3
default                                 fe80::%utun4                    UGcIg           utun4
::1                                     ::1                             UHL               lo0
fd00:1337::/48                          fe80::fa4d:89ff:fe85:2c87%utun5 Uc              utun5
fd00:1337:0:1:1::1                      link#26                         UHL               lo0

...truncated...

WireGuard information

$ ps aux | grep wireguard-go
root             25514   0.0  0.1 409219344  10144   ??  S    Tue08AM  14:41.65 wireguard-go utun

$ wireguard-go --version
wireguard-go v0.0.20230223

$ sudo wg
interface: utun5
  public key: <redacted>
  private key: (hidden)
  listening port: 60774

peer: <redacted>
  endpoint: <redacted>:51820
  allowed ips: fd00:1337::1/128
  latest handshake: 21 hours, 16 minutes, 31 seconds ago
  transfer: 16.56 GiB received, 838.20 MiB sent
  persistent keepalive: every 25 seconds

$ ifconfig utun5
utun5: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1280
        inet6 fe80::fa4d:89ff:fe85:2c87%utun5 prefixlen 64 scopeid 0x1a
        inet6 fd00:1337:0:1:1::1 prefixlen 48
        nd6 options=201<PERFORMNUD,DAD>

$ route -n get -inet6 fd00:1337:0:1:1::1
   route to: fd00:1337:0:1:1::1
destination: fd00:1337:0:1:1::1
  interface: lo0
      flags: <UP,HOST,DONE,LLINFO,LOCAL>
 recvpipe  sendpipe  ssthresh  rtt,msec    rttvar  hopcount      mtu     expire
       0         0         0         0         0         0     16384         0

$ route -n get -inet6 fd00:1337::1
   route to: fd00:1337::1
destination: fd00:1337::1
  interface: utun5
      flags: <UP,HOST,DONE,WASCLONED,IFSCOPE,IFREF>
 recvpipe  sendpipe  ssthresh  rtt,msec    rttvar  hopcount      mtu     expire
       0         0         0         0         0         0      1280         0

strohel commented 1 year ago

I think @goodhoko might have had a similar issue. Though for him "long uptime" was just minutes, so not sure if the same cause.

goodhoko commented 1 year ago

My problem may be just me not using innernet correctly. IDK why (I think someone told me it's fine to) but I used to ctrl+c the innernet up command before it established connection with all peers and ran to completion. This leaves the network stopped (even if it was previously running).

It adds to this confusion that the network works fine while innernet up is trying to establish connections. I tend to jump between terminal tabs a lot and I often started using innernet before innernet up finished in another tab. Thinking it's all set up I jumped back and killed innernet up stopping the network again.

Unrelated to this issue, but maybe we could either make innernet up handle SIGINT more gracefully in the phase of establishing connections with peers, or just print something like Received ctrl+c. Stopping XXX network. so that it's clear what's happening. Shall I create an issue for that?

I'm on mac with wireguard-go.

strohel commented 1 year ago

My problem may be just me not using innernet correctly. IDK why (I think someone told me it's fine to) but I used to ctrl+c the innernet up command before it established connection with all peers and ran to completion. This leaves the network stopped (even if it was previously running).

It definitely behaves differently for me: I can Ctrl+C it right after

strohel@thicky ~/work/portal $ innernet up
[*] fetching state for tonari from server...
[*]   peer dev-pablo-portal (9Wj1oUXCWW...) was modified.
[*]     Endpoint: 81.34.16.254:35626 => 2.138.197.161:49285
[*]   peer jen (gUwOAMVBQW...) was modified.
[*]     Endpoint: 192.168.1.1:52495 => 37.143.115.174:53063
[*]   peer taj (5iuERx/Z7v...) was modified.
[*]     Endpoint: 111.216.113.76:58107 => 133.201.82.64:32789

[*] updated interface tonari

[*] reporting 2 interface addresses as NAT traversal candidates

and the network stays up, most/all peers connected (just probably misses some NAT traversal oportunities, but that's a corner-case).

Linux, kernel-space wireguard here.

mcginty commented 1 year ago

@goodhoko that happens to me too, I think like you said we just need to handle SIGINT more intelligently. I'll open a separate issue for that, because this is a whole different beast...

tonarino / innernet