netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
10.73k stars 483 forks source link

Erratic Signal Disconnects and Errors Prevent Communication Between Peers #2625

Open trbutler opened 1 week ago

trbutler commented 1 week ago

Describe the problem

I frequently receive Signal: Disconnected, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded. on all of my Netbird clients. The issue appears to be degrading from something that caused intermittent communications problems to a situation where Netbird is almost completely non-functional to most of my clients. Inexplicably a few continue to work.

I've tried adapting my Netbird "quick start" self-hosted configuration to alleviate the issue. I moved from using Caddy to NGINX for reverse proxy. This sped things up a fair amount and reduced resource usage, but didn't fix the issue. I also tried directly exposing Signal (which I had Docker translate from 443 to port 30006) while giving it access to NGINX's SSL certificate, so that a reverse proxy was not involved at all. None of these three different arrangements resolved the issue.

When proxied through NGINX, the NGINX error log is filled with entries like this:

2024/09/20 09:02:12 [error] 616106#616106: *120768 upstream rejected request with error 0 while reading response header from upstream, client: [client IP address], server: anon1.anon-r6ORu.domain, request: "POST /signalexchange.SignalExchange/Send HTTP/2.0", upstream: "grpcs://127.0.0.1:30006", host: "cyprus.serverforest.com:443"

The Signal docker container doesn't show anything unusual, even when set to debug mode on the logs; it simply shows many messages being conveyed between peers.

To Reproduce

Steps to reproduce the behavior:

  1. Run netbird up
  2. Wait a moment and netbird status will report the issue.

Expected behavior

I'd expect Netbird to be able to connect to the Signal server without issue.

Are you using NetBird Cloud?

I'm using self-hosted netbird.

NetBird version

netbird version

NetBird status -dA output:

Peers detail:
 washington.anon-DK9Lf.domain:
  NetBird IP: 100.91.0.186
  Public key: KQIjQLtUaZM9J30rBp2AxHC4nrvn8neHA7Vg1DURkFg=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 jacques.anon-DK9Lf.domain:
  NetBird IP: 100.91.2.189
  Public key: zAEXmNz59Fr5IF+bLp+z5rc0THc/bhJJ+o4U8jHl+3U=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 juniper.anon-DK9Lf.domain:
  NetBird IP: 100.91.12.9
  Public key: 0BwIRdWYsyZxQJHdy/GxODrwRzfesIOI0t5JtvDoWRg=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.0:51820
  Relay server address: 
  Last connection update: 4 minutes, 26 seconds ago
  Last WireGuard handshake: 13 seconds ago
  Transfer status (received/sent) 95.6 KiB/17.4 KiB
  Quantum resistance: true
  Routes: -
  Latency: 226.080667ms

 iphone-admin.anon-DK9Lf.domain:
  NetBird IP: 100.91.41.130
  Public key: 8y2qoR39K7K5Vv6hABNKqRVEdZ/FkHBZKRhWqssNGS0=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 little-hills-live-stream.anon-DK9Lf.domain:
  NetBird IP: 100.91.41.180
  Public key: yBhdVf0uxhuvaAr4tVbFDLUWnTWg/JCOviH5T3KmphM=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.1:51820
  Relay server address: 
  Last connection update: 4 minutes, 44 seconds ago
  Last WireGuard handshake: 30 seconds ago
  Transfer status (received/sent) 72.1 KiB/11.0 KiB
  Quantum resistance: true
  Routes: -
  Latency: 206.283666ms

 cyprus.anon-DK9Lf.domain:
  NetBird IP: 100.91.63.165
  Public key: gqSAS+yo0Qp3RhqWaWlY0qyhLYugQ0+6HFIJAJNQZ24=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/host
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.2:51820
  Relay server address: rels://cyprus.serverforest.com:443
  Last connection update: 4 minutes, 45 seconds ago
  Last WireGuard handshake: 31 seconds ago
  Transfer status (received/sent) 456 B/1.1 KiB
  Quantum resistance: true
  Routes: -
  Latency: 202.555375ms

 mastodon1.anon-DK9Lf.domain:
  NetBird IP: 100.91.83.133
  Public key: gc7H34F3uuqW1oodfgHy5VyOU80AWPyiMZKPTWAoeV0=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 rosalind.anon-DK9Lf.domain:
  NetBird IP: 100.91.87.250
  Public key: 8ulzaG4yTm9RqIYMwRQXkw4LB7LDdhXy1ocdNCuEqBA=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/prflx
  ICE candidate endpoints (Local/Remote): 198.51.100.3:51820/198.51.100.4:48521
  Relay server address: 
  Last connection update: 4 minutes, 17 seconds ago
  Last WireGuard handshake: Now
  Transfer status (received/sent) 1008 B/804 B
  Quantum resistance: true
  Routes: -
  Latency: 202.577958ms

 little-hills-slides.anon-DK9Lf.domain:
  NetBird IP: 100.91.99.242
  Public key: PZEt9DoVoL3qataY9Oc0uyBtbbmk0Z7KgjslUGDoslk=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 touchstone.anon-DK9Lf.domain:
  NetBird IP: 100.91.112.131
  Public key: PAzQjGnO5xftL4rgeX9SdkajCjEJA3A+iViMbXoPgXE=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/srflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.5:1053
  Relay server address: rels://cyprus.serverforest.com:443
  Last connection update: 5 minutes, 23 seconds ago
  Last WireGuard handshake: 1 minute, 14 seconds ago
  Transfer status (received/sent) 276 B/924 B
  Quantum resistance: true
  Routes: -
  Latency: 26.766667ms

 independence.anon-DK9Lf.domain:
  NetBird IP: 100.91.122.117
  Public key: wwROJuAi9t5d7W8DnF78sdMTm13iDZ9YcrtjjHtIYDM=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/prflx
  ICE candidate endpoints (Local/Remote): 198.51.100.3:51820/198.51.100.5:51820
  Relay server address: rels://cyprus.serverforest.com:443
  Last connection update: 4 minutes, 45 seconds ago
  Last WireGuard handshake: 1 minute, 51 seconds ago
  Transfer status (received/sent) 360 B/716 B
  Quantum resistance: true
  Routes: -
  Latency: 26.469916ms

 spruce.anon-DK9Lf.domain:
  NetBird IP: 100.91.147.59
  Public key: 0sA1GjrlFs+yPKlh7CARYIoFA/Ydsa4Tq/jnpLw1axk=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.6:51820
  Relay server address: 
  Last connection update: 4 minutes, 26 seconds ago
  Last WireGuard handshake: 20 seconds ago
  Transfer status (received/sent) 97.9 KiB/12.7 KiB
  Quantum resistance: true
  Routes: -
  Latency: 812.671042ms

 franklin.anon-DK9Lf.domain:
  NetBird IP: 100.91.150.140
  Public key: v9F8qsB+L4fpvuTv9B8NiD27cx6h6dzVMC0XBwtw4WA=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 mesquite.anon-DK9Lf.domain:
  NetBird IP: 100.91.155.86
  Public key: vp6GLJc22GQXj2Ht5deowZp0OA8kG7XJS1kYl3zc6lI=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.7:51820
  Relay server address: rels://cyprus.serverforest.com:443
  Last connection update: 4 minutes, 45 seconds ago
  Last WireGuard handshake: 1 minute, 50 seconds ago
  Transfer status (received/sent) 392 B/716 B
  Quantum resistance: true
  Routes: -
  Latency: 347.723708ms

 miranda.anon-DK9Lf.domain:
  NetBird IP: 100.91.170.233
  Public key: D2k3MtkmfFLj9ZuJC/3KWEW1XhMesLNpHHz8P/86q2Q=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 ipad-admin.anon-DK9Lf.domain:
  NetBird IP: 100.91.178.26
  Public key: a+sg6th5zv4wl9zCN5/q5C3O8sZQh2SwgC/8gJZuyjQ=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 boaz.anon-DK9Lf.domain:
  NetBird IP: 100.91.182.98
  Public key: ydFumIBVUwCGBjx5Xh0pZPW1G6kFq2v+8DPNz1XYkRE=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 rahab.anon-DK9Lf.domain:
  NetBird IP: 100.91.203.23
  Public key: hxczQ9TIXjpDAFHDVzwjH6aDPlC5l5GcTj0LEmhgfRQ=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

windowspc.anon-DK9Lf.domain:
  NetBird IP: 100.91.212.35
  Public key: eDP33MB5NvltMsSq9XEoxYQXoBfJjLgX9BkA3/FjKnY=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 phebe.anon-DK9Lf.domain:
  NetBird IP: 100.91.224.117
  Public key: 0Bi8tUwaKffJVD69HXxQ6RbG+wdI1npXViS4Crw+yls=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 ipad-admin-1.anon-DK9Lf.domain:
  NetBird IP: 100.91.251.74
  Public key: nheiiB0C3H5uYy+cDvWj34o9nKotHZwNTZ1lHjCB4UQ=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

OS: darwin/arm64
Daemon version: 0.29.2
CLI version: 0.29.2
Management: Connected to https://anon1.anon-r6ORu.domain:443
Signal: Connected to https://anon1.anon-r6ORu.domain:30006
Relays: 
  [stun:anon1.anon-r6ORu.domain:3478] is Available
  [turn:anon1.anon-r6ORu.domain:3478?transport=udp] is Available
  [rels://anon1.anon-r6ORu.domain:443] is Available
Nameservers: 
FQDN: falstaff.anon-DK9Lf.domain
NetBird IP: 100.91.122.186/16
Interface type: Userspace
Quantum resistance: true (permissive)
Routes: -
Peers count: 8/21 Connected

Do you face any (non-mobile) client issues?

Yes, the issue prevents clients from functioning. Presently most clients cannot connect, although a few consistently do connect. There is no rhyme or reason I've been able to discern: with two clients in the same location, one consistently connects and one does not; the variation does not appear to relate to platform (some of what works is MacOS, some are running Debian Linux). Reauthorizing the clients with a new setup key doesn't seem to change things for the worse or better -- it is like they are "stuck" either working or not.

(Although all clients show the Signal error given above at least part of the time.)

Additional context

I'm using a modified version of the docker-compose.yml that was available back in December 2023. It's been upgraded to add the new relay container, remove Caddy (as noted above as part of troubleshooting), expose the NGINX SSL cert to Signal, etc. Because it is from last year, it uses CockroachDB instead of PostgreSQL. I've wondered about finding a way to migrate cleanly to PostgreSQL, though I don't know if that'd materially affect this problem or not.

My docker-compose.yml:

version: "3.4"
services:
  # Caddy reverse proxy
#  caddy:
#    image: caddy
#    restart: unless-stopped
#    networks: [ netbird ]
#    #ports:
#    #  - '443:443'
#    #  - '80:80'
#    #  - '8080:8080'
#    volumes:
#      - netbird_caddy_data:/data
#      - ./Caddyfile:/etc/caddy/Caddyfile
  relay:
    image: netbirdio/relay:latest
    restart: unless-stopped
    networks: [netbird]
    ports:
      - '30005:80'
    env_file:
      - ./relay.env
    logging:
      driver: "json-file"
      options:
        max-size: "500m"
        max-file: "2"

  #UI dashboard
  dashboard:
    image: netbirdio/dashboard:latest
    restart: unless-stopped
    networks: [netbird]
    ports: 
      - '30001:80'
    env_file:
      - ./dashboard.env
  # Signal
  signal:
    image: netbirdio/signal:latest
    restart: unless-stopped
    networks: [netbird]
    ports:
      - '30002:80'
      - '30006:443'
    command: [ "--log-file", "console","--log-level","debug","--cert-file","/ssl/fullchain.pem","--cert-key","/ssl/privkey.pem" ]
    volumes:
      - /etc/letsencrypt/live/anon1.anon-r6ORu.domain/fullchain.pem:/ssl/fullchain.pem:ro
      - /etc/letsencrypt/live/anon1.anon-r6ORu.domain/privkey.pem:/ssl/privkey.pem:ro
  # Management
  management:
    image: netbirdio/management:latest
    restart: unless-stopped
    networks: [netbird]
    ports:
      - '30003:80'
    volumes:
      - netbird_management:/var/lib/netbird
      - ./management.json:/etc/netbird/management.json
    command: [
      "--port", "80",
      "--log-file", "console",
      "--log-level", "info",
      "--disable-anonymous-metrics=false",
      "--single-account-mode-domain=anon2.domain",
      "--dns-domain=anon2.domain",
      "--idp-sign-key-refresh-enabled",
    ]
  # Coturn, AKA relay server
  coturn:
    image: coturn/coturn
    restart: unless-stopped
    domainname: netbird.relay.selfhosted
    volumes:
      - ./turnserver.conf:/etc/turnserver.conf:ro
    network_mode: host
    command:
      - -c /etc/turnserver.conf
  # Zitadel - identity provider
  zitadel:
    restart: 'always'
    networks: [netbird]
    image: 'ghcr.io/zitadel/zitadel:v2.31.3'
    command: 'start-from-init --masterkeyFromEnv --tlsMode external'
    ports:
      - '30004:8080'
    env_file:
      - ./zitadel.env
    depends_on:
      crdb:
        condition: 'service_healthy'
    volumes:
      - ./machinekey:/machinekey
      - netbird_zitadel_certs:/crdb-certs:ro
  # CockroachDB for zitadel
  crdb:
    restart: 'always'
    networks: [netbird]
    image: 'cockroachdb/cockroach:v22.2.2'
    command: 'start-single-node --advertise-addr crdb'
    volumes:
      - netbird_crdb_data:/cockroach/cockroach-data
      - netbird_crdb_certs:/cockroach/certs
      - netbird_zitadel_certs:/zitadel-certs
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:8080/health?ready=1" ]
      interval: '10s'
      timeout: '30s'
      retries: 5
      start_period: '20s'

volumes:
  netbird_management:
  netbird_caddy_data:
  netbird_crdb_data:
  netbird_crdb_certs:
  netbird_zitadel_certs:

networks:
  netbird:
trbutler commented 1 week ago

I still haven't been able to solve this, but I did setup a second Netbird server and moved the peers over to it. So far, I've not been seeing the same issue. So it makes me think perhaps it is something to do with the upgrade path to the latest containers? It still worries me, though, since it took a clean slate with all the peers being manually reconnected to a new installation of the server to get things up and running again. I'm going to wipe the old server, but have left it up for the moment if you have any debug data about it you'd like before I wipe it.