Signal Service - Peers are reconnecting every minute

da-wilky commented 3 months ago

Hello,

Describe the problem

Im using docker to host my own netbird management service behind traefik reverse-proxy. Everything is working, but my signal service is permently logging reconnects of the peers. Im curious if this is the expected behavior.

Example logs

signal-1  | 2024-07-31T08:44:33Z INFO signal/server/signal.go:110: peer connected [<client1>] [streamID 1722415473781694381] 
signal-1  | 2024-07-31T08:44:38Z INFO signal/server/signal.go:110: peer connected [<client2>] [streamID 1722415478424239799] 
signal-1  | 2024-07-31T08:44:56Z INFO signal/server/signal.go:110: peer connected [<client3>] [streamID 1722415496841138613] 
signal-1  | 2024-07-31T08:44:59Z INFO signal/server/signal.go:110: peer connected [<client4>] [streamID 1722415499259544723] 
signal-1  | 2024-07-31T08:44:59Z INFO signal/server/signal.go:110: peer connected [<client5>] [streamID 1722415499922935186] 
signal-1  | 2024-07-31T08:45:01Z INFO signal/server/signal.go:110: peer connected [<client6>] [streamID 1722415501822241363] 
signal-1  | 2024-07-31T08:45:33Z INFO signal/server/signal.go:95: peer disconnected [<client1>] [streamID 1722415473781694381] 
signal-1  | 2024-07-31T08:45:34Z INFO signal/server/signal.go:110: peer connected [<client1>] [streamID 1722415534361482785] 
signal-1  | 2024-07-31T08:45:38Z INFO signal/server/signal.go:95: peer disconnected [<client2>] [streamID 1722415478424239799] 
signal-1  | 2024-07-31T08:45:39Z INFO signal/server/signal.go:110: peer connected [<client2>] [streamID 1722415539494496594] 
signal-1  | 2024-07-31T08:45:56Z INFO signal/server/signal.go:95: peer disconnected [<client3>] [streamID 1722415496841138613] 
signal-1  | 2024-07-31T08:45:57Z INFO signal/server/signal.go:110: peer connected [<client3>] [streamID 1722415557714556130] 
signal-1  | 2024-07-31T08:45:59Z INFO signal/server/signal.go:95: peer disconnected [<client4>] [streamID 1722415499259544723] 
signal-1  | 2024-07-31T08:45:59Z INFO signal/server/signal.go:95: peer disconnected [<client5>] [streamID 1722415499922935186] 
signal-1  | 2024-07-31T08:46:00Z INFO signal/server/signal.go:110: peer connected [<client5>] [streamID 1722415560435622333] 
signal-1  | 2024-07-31T08:46:00Z INFO signal/server/signal.go:110: peer connected [<client4>] [streamID 1722415560465990002] 
signal-1  | 2024-07-31T08:46:01Z INFO signal/server/signal.go:95: peer disconnected [<client6>] [streamID 1722415501822241363] 
signal-1  | 2024-07-31T08:46:03Z INFO signal/server/signal.go:110: peer connected [<client6>] [streamID 1722415563203704936] 
signal-1  | 2024-07-31T08:46:34Z INFO signal/server/signal.go:95: peer disconnected [<client1>] [streamID 1722415534361482785] 
signal-1  | 2024-07-31T08:46:35Z INFO signal/server/signal.go:110: peer connected [<client1>] [streamID 1722415595862371424]

On the clients the following logs are produced:

2024-07-31T10:45:33+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:45:34+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:46:34+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:46:35+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:47:35+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:47:37+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:48:37+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:48:38+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:49:38+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:49:40+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:50:40+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:50:40+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:51:40+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:51:40+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream

So it reconnects every minute because of an rpc error. But just by googling I cant figure out why. Again - everything is still working fine - at least I cant recognize any errors apart those logs.

To Reproduce

Steps to reproduce the behavior:

Host your own netbird with traefik proxy.


version: "3"
services:
#UI dashboard
dashboard:
image: netbirdio/dashboard:latest
restart: unless-stopped
#ports:
#  - 80:80
#  - 443:443
environment:
  # Endpoints
  - NETBIRD_MGMT_API_ENDPOINT=https://mydomain.com:443
  - NETBIRD_MGMT_GRPC_API_ENDPOINT=https://mydomain.com:443
  # OIDC
  - AUTH_AUDIENCE=netbird-client
  - AUTH_CLIENT_ID=netbird-client
  - AUTH_CLIENT_SECRET=
  - AUTH_AUTHORITY=https://mykeycloak.com/realms/netbird
  - USE_AUTH0=false
  - AUTH_SUPPORTED_SCOPES=openid profile email offline_access api
  - AUTH_REDIRECT_URI=
  - AUTH_SILENT_REDIRECT_URI=
  - NETBIRD_TOKEN_SOURCE=accessToken
  # SSL
  - NGINX_SSL_PORT=443
networks:
  default:
  traefik_net:
labels:
  - traefik.enable=true
  - traefik.http.routers.netbird-dashboard.entrypoints=http
  - traefik.http.routers.netbird-dashboard.rule=Host(`mydomain.com`)
  - traefik.http.routers.netbird-dashboard.middlewares=to-https@file
  - traefik.http.routers.netbird-dashboard-secure.entrypoints=https
  - traefik.http.routers.netbird-dashboard-secure.rule=Host(`mydomain.com`)
  - traefik.http.routers.netbird-dashboard-secure.service=netbird-dashboard
  - traefik.http.services.netbird-dashboard.loadbalancer.server.port=80

# Signal
signal:
image: netbirdio/signal:latest
restart: unless-stopped
volumes:
  - netbird-signal:/var/lib/netbird
#ports:
#  - 10000:80
## port and command for Let's Encrypt validation
#  - 443:443
#  command: ["--letsencrypt-domain", "", "--log-file", "console"]
networks:
  default:
  traefik_net:
labels:
- traefik.enable=true
- traefik.http.routers.netbird-signal.entrypoints=http
- traefik.http.routers.netbird-signal.rule=Host(`mydomain.com`) && PathPrefix(`/signalexchange.SignalExchange/`)
- traefik.http.routers.netbird-signal.middlewares=to-https@file
- traefik.http.routers.netbird-signal-secure.entrypoints=https
- traefik.http.routers.netbird-signal-secure.rule=Host(`mydomain.com`) && PathPrefix(`/signalexchange.SignalExchange`)
- traefik.http.routers.netbird-signal-secure.service=netbird-signal
- traefik.http.services.netbird-signal.loadbalancer.server.port=80
- traefik.http.services.netbird-signal.loadbalancer.server.scheme=h2c

# Management
management:
image: netbirdio/management:latest
restart: unless-stopped
depends_on:
  - dashboard
volumes:
  - netbird-mgmt:/var/lib/netbird
  - ./management.json:/etc/netbird/management.json
#ports:
#  - 443:443 #API port
#    # command for Let's Encrypt validation without dashboard container
#    command: ["--letsencrypt-domain", "", "--log-file", "console"]
command: [
  "--port", "443",
  "--log-file", "console",
  "--disable-anonymous-metrics=true",
  "--single-account-mode-domain=mydomain.com",
  "--dns-domain=netbird.selfhosted",
  "--user-delete-from-idp" # Added myself so netbird can handle users on its own
  ]
networks:
  default:
  traefik_net:
labels:
- traefik.enable=true
- traefik.http.routers.netbird-api.entrypoints=http
- traefik.http.routers.netbird-api.rule=Host(`mydomain.com`) && PathPrefix(`/api`)
- traefik.http.routers.netbird-api.middlewares=to-https@file
- traefik.http.routers.netbird-api.service=netbird-api
- traefik.http.routers.netbird-api-secure.entrypoints=https
- traefik.http.routers.netbird-api-secure.rule=Host(`mydomain.com`) && PathPrefix(`/api`)
- traefik.http.routers.netbird-api-secure.service=netbird-api
- traefik.http.services.netbird-api.loadbalancer.server.port=443

- traefik.http.routers.netbird-management.entrypoints=http
- traefik.http.routers.netbird-management.rule=Host(`mydomain.com`) && PathPrefix(`/management.ManagementService/`)
- traefik.http.routers.netbird-management.middlewares=to-https@file
- traefik.http.routers.netbird-management.service=netbird-management
- traefik.http.routers.netbird-management-secure.entrypoints=https
- traefik.http.routers.netbird-management-secure.rule=Host(`mydomain.com`) && PathPrefix(`/management.ManagementService/`)
- traefik.http.routers.netbird-management-secure.service=netbird-management
- traefik.http.services.netbird-management.loadbalancer.server.port=443
- traefik.http.services.netbird-management.loadbalancer.server.scheme=h2c

# Coturn
coturn:
image: coturn/coturn:latest
restart: unless-stopped
domainname: mydomain.com
volumes:
  - ./turnserver.conf:/etc/turnserver.conf:ro
#      - ./privkey.pem:/etc/coturn/private/privkey.pem:ro
#      - ./cert.pem:/etc/coturn/certs/cert.pem:ro
network_mode: host
command:
  - -c /etc/turnserver.conf

volumes: netbird-mgmt: netbird-signal:

networks: traefik_net: external: true

2. Run the services
3. Connect clients (I have clients using versions: `0.28.4`, `0.28.5` and `0.28.6` with all the same errors.)

**Expected behavior**

I expected it to be working without logs. It's indeed working, but with those logs that I dont know if they are problem. But guess it is one if they are produced - at least some misconfiguration.

**NetBird version**

Tested on `0.28.4`, `0.28.5` and `0.28.6`

**NetBird status -d output:**

Peers detail: peer1.netbird.selfhosted: NetBird IP: /32 Public key: BJjacf7HSPwCzkxFkUlFlWMBbXiFQyLzhc2z6YrBd3E= Status: Disconnected -- detail -- Connection type: P2P Direct: false ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): :51820/:62082 Last connection update: 20 minutes, 7 seconds ago Last WireGuard handshake: 2 minutes, 26 seconds ago Transfer status (received/sent) 2.1 KiB/3.8 KiB Quantum resistance: false Routes: - Latency: 0s

peer2.netbird.selfhosted: NetBird IP: /32 Public key: FGy/FK+e8p3qTFulkif4V+hI81U9I+QFKqttzBx7qE4= Status: Disconnected -- detail -- Connection type: P2P Direct: false ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): :51820/:62082 Last connection update: 20 minutes, 7 seconds ago Last WireGuard handshake: 2 minutes, 26 seconds ago Transfer status (received/sent) 2.1 KiB/3.8 KiB Quantum resistance: false Routes: - Latency: 0s

peer3.netbird.selfhosted: NetBird IP: Public key: qXy69ZUHM8IQJIapY/5z54AGTa2UyXPSefDISbzyByA= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): srflx/srflx ICE candidate endpoints (Local/Remote): :51820/:51820 Last connection update: 34 minutes, 36 seconds ago Last WireGuard handshake: 1 minute, 57 seconds ago Transfer status (received/sent) 3.0 MiB/25.1 MiB Quantum resistance: false Routes: - Latency: 12.297417ms

peer4.netbird.selfhosted: NetBird IP: Public key: MvDBVV63GFtK+MdEy3lV9/73Fuw5eUaRmNfQTMEi3DQ= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): :51820/:62082 Last connection update: 34 minutes, 38 seconds ago Last WireGuard handshake: 2 minutes, 26 seconds ago Transfer status (received/sent) 2.1 KiB/3.8 KiB Quantum resistance: false Routes: - Latency: 12.580678ms

OS: linux/amd64 Daemon version: 0.28.4 CLI version: 0.28.4 Management: Connected to https://mydomain.com:443 Signal: Connected to https://mydomain.com:443 Relays: [stun:mydomain.com:3478] is Available [turn:mydomain.com:3478?transport=udp] is Available Nameservers: [1.1.1.1:53, 1.0.0.1:53] for [.] is Available FQDN: peer5.netbird.selfhosted NetBird IP: /16 Interface type: Kernel Quantum resistance: false Routes: - Peers count: 2/4 Connected



Glad if anyone has an idea. If any information is missing please request it.

Have a good one!

Sleepingme00 commented 3 months ago

I also have the same issue, appreciate anyone give a suggestion.

lixmal commented 3 months ago

Do you have any proxy in front that cuts the grpc connection after some duration? Could be traefik itself

da-wilky commented 3 months ago

You are completly right, thank you for the hint! I had some research before in the direction of traefik, but didnt find anything, so I thought its about the signaling service, even tho it's indeed a traefik misconfiguration and not related to netbird.

Traefik has a default readTimeout of 60s. (https://doc.traefik.io/traefik/routing/entrypoints/#respondingtimeouts) This is to protect traefik from DoS Attacks, see the following resources:

Currently its not possible to set this timeout on an per router/service level with traefik (resource 1, resource 2). Hoepfully they plan to implement this feature.

So the current possibilities are:

Live with the logs - it's still working fine and will be reconnecting immediatly after traefik cuts the connection. You might also increase the default duration by a few seconds to minimize the logs. Your decision if this is worth it - as said, its working fine and will be automatically reconnecting. You would just get a few less logs.
Disable readTimeout by setting it to 0 with the huge drawback of exposing your traefik to the vulnerability.
Switch from traefik to e.g. NGINX, that supports setting a custom read timeout per server/location.

Hope that helps somebody. Thanks to @lixmal for pointing the direction!

Cheers

Sleepingme00 commented 3 months ago

You are completly right, thank you for the hint! I had some research before in the direction of traefik, but didnt find anything, so I thought its about the signaling service, even tho it's indeed a traefik misconfiguration and not related to netbird.

Traefik has a default readTimeout of 60s. (https://doc.traefik.io/traefik/routing/entrypoints/#respondingtimeouts) This is to protect traefik from DoS Attacks, see the following resources:

Default Read Timeout got introduced

Disable Read Timeout makes traefik vulnerable to CVE-2024-28869

CVE-2024-28869

Currently its not possible to set this timeout on an per router/service level with traefik (resource 1, resource 2). Hoepfully they plan to implement this feature.

So the current possibilities are:

Live with the logs - it's still working fine and will be reconnecting immediatly after traefik cuts the connection. You might also increase the default duration by a few seconds to minimize the logs. Your decision if this is worth it - as said, its working fine and will be automatically reconnecting. You would just get a few less logs.

Disable readTimeout by setting it to 0 with the huge drawback of exposing your traefik to the vulnerability.

Switch from traefik to e.g. NGINX, that supports setting a custom read timeout per server/location.

Hope that helps somebody. Thanks to @lixmal for pointing the direction!

Cheers

Refer to @da-wilky 's suggestion, i also has updated the Traefik config as following.

entryPoints:
  http:
    address: ":80"
    http:
      redirections:
        entryPoint:
          to: https
          scheme: https
  https:
    address: ":443"
    transport:
      respondingTimeouts:
        readTimeout: 0

after retarting the sercies, peers are reconnecting every minute was gone! The corresponding risk also follows, of course, this balance can only be measured by myself. Such a great reminder, appreciate @da-wilky @lixmal 's guidance!

signal-1-log

netbirdio / netbird

Signal Service - Peers are reconnecting every minute #2358