Erratic Signal Disconnects and Errors Prevent Communication Between Peers

Describe the problem

I frequently receive Signal: Disconnected, reason: rpc error: code = DeadlineExceeded desc = context deadline exceeded. on all of my Netbird clients. The issue appears to be degrading from something that caused intermittent communications problems to a situation where Netbird is almost completely non-functional to most of my clients. Inexplicably a few continue to work.

I've tried adapting my Netbird "quick start" self-hosted configuration to alleviate the issue. I moved from using Caddy to NGINX for reverse proxy. This sped things up a fair amount and reduced resource usage, but didn't fix the issue. I also tried directly exposing Signal (which I had Docker translate from 443 to port 30006) while giving it access to NGINX's SSL certificate, so that a reverse proxy was not involved at all. None of these three different arrangements resolved the issue.

When proxied through NGINX, the NGINX error log is filled with entries like this:

2024/09/20 09:02:12 [error] 616106#616106: *120768 upstream rejected request with error 0 while reading response header from upstream, client: [client IP address], server: anon1.anon-r6ORu.domain, request: "POST /signalexchange.SignalExchange/Send HTTP/2.0", upstream: "grpcs://127.0.0.1:30006", host: "cyprus.serverforest.com:443"

The Signal docker container doesn't show anything unusual, even when set to debug mode on the logs; it simply shows many messages being conveyed between peers.

To Reproduce

Steps to reproduce the behavior:

Run netbird up
Wait a moment and netbird status will report the issue.

Expected behavior

I'd expect Netbird to be able to connect to the Signal server without issue.

Are you using NetBird Cloud?

I'm using self-hosted netbird.

NetBird version

netbird version

NetBird status -dA output:

Peers detail:
 washington.anon-DK9Lf.domain:
  NetBird IP: 100.91.0.186
  Public key: KQIjQLtUaZM9J30rBp2AxHC4nrvn8neHA7Vg1DURkFg=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 jacques.anon-DK9Lf.domain:
  NetBird IP: 100.91.2.189
  Public key: zAEXmNz59Fr5IF+bLp+z5rc0THc/bhJJ+o4U8jHl+3U=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 juniper.anon-DK9Lf.domain:
  NetBird IP: 100.91.12.9
  Public key: 0BwIRdWYsyZxQJHdy/GxODrwRzfesIOI0t5JtvDoWRg=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.0:51820
  Relay server address: 
  Last connection update: 4 minutes, 26 seconds ago
  Last WireGuard handshake: 13 seconds ago
  Transfer status (received/sent) 95.6 KiB/17.4 KiB
  Quantum resistance: true
  Routes: -
  Latency: 226.080667ms

 iphone-admin.anon-DK9Lf.domain:
  NetBird IP: 100.91.41.130
  Public key: 8y2qoR39K7K5Vv6hABNKqRVEdZ/FkHBZKRhWqssNGS0=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 little-hills-live-stream.anon-DK9Lf.domain:
  NetBird IP: 100.91.41.180
  Public key: yBhdVf0uxhuvaAr4tVbFDLUWnTWg/JCOviH5T3KmphM=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.1:51820
  Relay server address: 
  Last connection update: 4 minutes, 44 seconds ago
  Last WireGuard handshake: 30 seconds ago
  Transfer status (received/sent) 72.1 KiB/11.0 KiB
  Quantum resistance: true
  Routes: -
  Latency: 206.283666ms

 cyprus.anon-DK9Lf.domain:
  NetBird IP: 100.91.63.165
  Public key: gqSAS+yo0Qp3RhqWaWlY0qyhLYugQ0+6HFIJAJNQZ24=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/host
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.2:51820
  Relay server address: rels://cyprus.serverforest.com:443
  Last connection update: 4 minutes, 45 seconds ago
  Last WireGuard handshake: 31 seconds ago
  Transfer status (received/sent) 456 B/1.1 KiB
  Quantum resistance: true
  Routes: -
  Latency: 202.555375ms

 mastodon1.anon-DK9Lf.domain:
  NetBird IP: 100.91.83.133
  Public key: gc7H34F3uuqW1oodfgHy5VyOU80AWPyiMZKPTWAoeV0=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 rosalind.anon-DK9Lf.domain:
  NetBird IP: 100.91.87.250
  Public key: 8ulzaG4yTm9RqIYMwRQXkw4LB7LDdhXy1ocdNCuEqBA=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/prflx
  ICE candidate endpoints (Local/Remote): 198.51.100.3:51820/198.51.100.4:48521
  Relay server address: 
  Last connection update: 4 minutes, 17 seconds ago
  Last WireGuard handshake: Now
  Transfer status (received/sent) 1008 B/804 B
  Quantum resistance: true
  Routes: -
  Latency: 202.577958ms

 little-hills-slides.anon-DK9Lf.domain:
  NetBird IP: 100.91.99.242
  Public key: PZEt9DoVoL3qataY9Oc0uyBtbbmk0Z7KgjslUGDoslk=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 touchstone.anon-DK9Lf.domain:
  NetBird IP: 100.91.112.131
  Public key: PAzQjGnO5xftL4rgeX9SdkajCjEJA3A+iViMbXoPgXE=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/srflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.5:1053
  Relay server address: rels://cyprus.serverforest.com:443
  Last connection update: 5 minutes, 23 seconds ago
  Last WireGuard handshake: 1 minute, 14 seconds ago
  Transfer status (received/sent) 276 B/924 B
  Quantum resistance: true
  Routes: -
  Latency: 26.766667ms

 independence.anon-DK9Lf.domain:
  NetBird IP: 100.91.122.117
  Public key: wwROJuAi9t5d7W8DnF78sdMTm13iDZ9YcrtjjHtIYDM=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): srflx/prflx
  ICE candidate endpoints (Local/Remote): 198.51.100.3:51820/198.51.100.5:51820
  Relay server address: rels://cyprus.serverforest.com:443
  Last connection update: 4 minutes, 45 seconds ago
  Last WireGuard handshake: 1 minute, 51 seconds ago
  Transfer status (received/sent) 360 B/716 B
  Quantum resistance: true
  Routes: -
  Latency: 26.469916ms

 spruce.anon-DK9Lf.domain:
  NetBird IP: 100.91.147.59
  Public key: 0sA1GjrlFs+yPKlh7CARYIoFA/Ydsa4Tq/jnpLw1axk=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.6:51820
  Relay server address: 
  Last connection update: 4 minutes, 26 seconds ago
  Last WireGuard handshake: 20 seconds ago
  Transfer status (received/sent) 97.9 KiB/12.7 KiB
  Quantum resistance: true
  Routes: -
  Latency: 812.671042ms

 franklin.anon-DK9Lf.domain:
  NetBird IP: 100.91.150.140
  Public key: v9F8qsB+L4fpvuTv9B8NiD27cx6h6dzVMC0XBwtw4WA=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 mesquite.anon-DK9Lf.domain:
  NetBird IP: 100.91.155.86
  Public key: vp6GLJc22GQXj2Ht5deowZp0OA8kG7XJS1kYl3zc6lI=
  Status: Connected
  -- detail --
  Connection type: P2P
  ICE candidate (Local/Remote): host/prflx
  ICE candidate endpoints (Local/Remote): 192.168.0.48:51820/198.51.100.7:51820
  Relay server address: rels://cyprus.serverforest.com:443
  Last connection update: 4 minutes, 45 seconds ago
  Last WireGuard handshake: 1 minute, 50 seconds ago
  Transfer status (received/sent) 392 B/716 B
  Quantum resistance: true
  Routes: -
  Latency: 347.723708ms

 miranda.anon-DK9Lf.domain:
  NetBird IP: 100.91.170.233
  Public key: D2k3MtkmfFLj9ZuJC/3KWEW1XhMesLNpHHz8P/86q2Q=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 ipad-admin.anon-DK9Lf.domain:
  NetBird IP: 100.91.178.26
  Public key: a+sg6th5zv4wl9zCN5/q5C3O8sZQh2SwgC/8gJZuyjQ=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 boaz.anon-DK9Lf.domain:
  NetBird IP: 100.91.182.98
  Public key: ydFumIBVUwCGBjx5Xh0pZPW1G6kFq2v+8DPNz1XYkRE=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 rahab.anon-DK9Lf.domain:
  NetBird IP: 100.91.203.23
  Public key: hxczQ9TIXjpDAFHDVzwjH6aDPlC5l5GcTj0LEmhgfRQ=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

windowspc.anon-DK9Lf.domain:
  NetBird IP: 100.91.212.35
  Public key: eDP33MB5NvltMsSq9XEoxYQXoBfJjLgX9BkA3/FjKnY=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 phebe.anon-DK9Lf.domain:
  NetBird IP: 100.91.224.117
  Public key: 0Bi8tUwaKffJVD69HXxQ6RbG+wdI1npXViS4Crw+yls=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

 ipad-admin-1.anon-DK9Lf.domain:
  NetBird IP: 100.91.251.74
  Public key: nheiiB0C3H5uYy+cDvWj34o9nKotHZwNTZ1lHjCB4UQ=
  Status: Disconnected
  -- detail --
  Connection type: 
  ICE candidate (Local/Remote): -/-
  ICE candidate endpoints (Local/Remote): -/-
  Relay server address: 
  Last connection update: -
  Last WireGuard handshake: -
  Transfer status (received/sent) 0 B/0 B
  Quantum resistance: false (remote didn't enable quantum resistance)
  Routes: -
  Latency: 0s

OS: darwin/arm64
Daemon version: 0.29.2
CLI version: 0.29.2
Management: Connected to https://anon1.anon-r6ORu.domain:443
Signal: Connected to https://anon1.anon-r6ORu.domain:30006
Relays: 
  [stun:anon1.anon-r6ORu.domain:3478] is Available
  [turn:anon1.anon-r6ORu.domain:3478?transport=udp] is Available
  [rels://anon1.anon-r6ORu.domain:443] is Available
Nameservers: 
FQDN: falstaff.anon-DK9Lf.domain
NetBird IP: 100.91.122.186/16
Interface type: Userspace
Quantum resistance: true (permissive)
Routes: -
Peers count: 8/21 Connected

Do you face any (non-mobile) client issues?

Yes, the issue prevents clients from functioning. Presently most clients cannot connect, although a few consistently do connect. There is no rhyme or reason I've been able to discern: with two clients in the same location, one consistently connects and one does not; the variation does not appear to relate to platform (some of what works is MacOS, some are running Debian Linux). Reauthorizing the clients with a new setup key doesn't seem to change things for the worse or better -- it is like they are "stuck" either working or not.

(Although all clients show the Signal error given above at least part of the time.)

Additional context

I'm using a modified version of the docker-compose.yml that was available back in December 2023. It's been upgraded to add the new relay container, remove Caddy (as noted above as part of troubleshooting), expose the NGINX SSL cert to Signal, etc. Because it is from last year, it uses CockroachDB instead of PostgreSQL. I've wondered about finding a way to migrate cleanly to PostgreSQL, though I don't know if that'd materially affect this problem or not.

My docker-compose.yml:

version: "3.4"
services:
  # Caddy reverse proxy
#  caddy:
#    image: caddy
#    restart: unless-stopped
#    networks: [ netbird ]
#    #ports:
#    #  - '443:443'
#    #  - '80:80'
#    #  - '8080:8080'
#    volumes:
#      - netbird_caddy_data:/data
#      - ./Caddyfile:/etc/caddy/Caddyfile
  relay:
    image: netbirdio/relay:latest
    restart: unless-stopped
    networks: [netbird]
    ports:
      - '30005:80'
    env_file:
      - ./relay.env
    logging:
      driver: "json-file"
      options:
        max-size: "500m"
        max-file: "2"

  #UI dashboard
  dashboard:
    image: netbirdio/dashboard:latest
    restart: unless-stopped
    networks: [netbird]
    ports: 
      - '30001:80'
    env_file:
      - ./dashboard.env
  # Signal
  signal:
    image: netbirdio/signal:latest
    restart: unless-stopped
    networks: [netbird]
    ports:
      - '30002:80'
      - '30006:443'
    command: [ "--log-file", "console","--log-level","debug","--cert-file","/ssl/fullchain.pem","--cert-key","/ssl/privkey.pem" ]
    volumes:
      - /etc/letsencrypt/live/anon1.anon-r6ORu.domain/fullchain.pem:/ssl/fullchain.pem:ro
      - /etc/letsencrypt/live/anon1.anon-r6ORu.domain/privkey.pem:/ssl/privkey.pem:ro
  # Management
  management:
    image: netbirdio/management:latest
    restart: unless-stopped
    networks: [netbird]
    ports:
      - '30003:80'
    volumes:
      - netbird_management:/var/lib/netbird
      - ./management.json:/etc/netbird/management.json
    command: [
      "--port", "80",
      "--log-file", "console",
      "--log-level", "info",
      "--disable-anonymous-metrics=false",
      "--single-account-mode-domain=anon2.domain",
      "--dns-domain=anon2.domain",
      "--idp-sign-key-refresh-enabled",
    ]
  # Coturn, AKA relay server
  coturn:
    image: coturn/coturn
    restart: unless-stopped
    domainname: netbird.relay.selfhosted
    volumes:
      - ./turnserver.conf:/etc/turnserver.conf:ro
    network_mode: host
    command:
      - -c /etc/turnserver.conf
  # Zitadel - identity provider
  zitadel:
    restart: 'always'
    networks: [netbird]
    image: 'ghcr.io/zitadel/zitadel:v2.31.3'
    command: 'start-from-init --masterkeyFromEnv --tlsMode external'
    ports:
      - '30004:8080'
    env_file:
      - ./zitadel.env
    depends_on:
      crdb:
        condition: 'service_healthy'
    volumes:
      - ./machinekey:/machinekey
      - netbird_zitadel_certs:/crdb-certs:ro
  # CockroachDB for zitadel
  crdb:
    restart: 'always'
    networks: [netbird]
    image: 'cockroachdb/cockroach:v22.2.2'
    command: 'start-single-node --advertise-addr crdb'
    volumes:
      - netbird_crdb_data:/cockroach/cockroach-data
      - netbird_crdb_certs:/cockroach/certs
      - netbird_zitadel_certs:/zitadel-certs
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:8080/health?ready=1" ]
      interval: '10s'
      timeout: '30s'
      retries: 5
      start_period: '20s'

volumes:
  netbird_management:
  netbird_caddy_data:
  netbird_crdb_data:
  netbird_crdb_certs:
  netbird_zitadel_certs:

networks:
  netbird:

netbirdio / netbird

Erratic Signal Disconnects and Errors Prevent Communication Between Peers #2625