netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
11.01k stars 504 forks source link

Signal Service - Peers are reconnecting every minute #2358

Closed da-wilky closed 3 months ago

da-wilky commented 3 months ago

Hello,

Describe the problem

Im using docker to host my own netbird management service behind traefik reverse-proxy. Everything is working, but my signal service is permently logging reconnects of the peers. Im curious if this is the expected behavior.

Example logs

signal-1  | 2024-07-31T08:44:33Z INFO signal/server/signal.go:110: peer connected [<client1>] [streamID 1722415473781694381] 
signal-1  | 2024-07-31T08:44:38Z INFO signal/server/signal.go:110: peer connected [<client2>] [streamID 1722415478424239799] 
signal-1  | 2024-07-31T08:44:56Z INFO signal/server/signal.go:110: peer connected [<client3>] [streamID 1722415496841138613] 
signal-1  | 2024-07-31T08:44:59Z INFO signal/server/signal.go:110: peer connected [<client4>] [streamID 1722415499259544723] 
signal-1  | 2024-07-31T08:44:59Z INFO signal/server/signal.go:110: peer connected [<client5>] [streamID 1722415499922935186] 
signal-1  | 2024-07-31T08:45:01Z INFO signal/server/signal.go:110: peer connected [<client6>] [streamID 1722415501822241363] 
signal-1  | 2024-07-31T08:45:33Z INFO signal/server/signal.go:95: peer disconnected [<client1>] [streamID 1722415473781694381] 
signal-1  | 2024-07-31T08:45:34Z INFO signal/server/signal.go:110: peer connected [<client1>] [streamID 1722415534361482785] 
signal-1  | 2024-07-31T08:45:38Z INFO signal/server/signal.go:95: peer disconnected [<client2>] [streamID 1722415478424239799] 
signal-1  | 2024-07-31T08:45:39Z INFO signal/server/signal.go:110: peer connected [<client2>] [streamID 1722415539494496594] 
signal-1  | 2024-07-31T08:45:56Z INFO signal/server/signal.go:95: peer disconnected [<client3>] [streamID 1722415496841138613] 
signal-1  | 2024-07-31T08:45:57Z INFO signal/server/signal.go:110: peer connected [<client3>] [streamID 1722415557714556130] 
signal-1  | 2024-07-31T08:45:59Z INFO signal/server/signal.go:95: peer disconnected [<client4>] [streamID 1722415499259544723] 
signal-1  | 2024-07-31T08:45:59Z INFO signal/server/signal.go:95: peer disconnected [<client5>] [streamID 1722415499922935186] 
signal-1  | 2024-07-31T08:46:00Z INFO signal/server/signal.go:110: peer connected [<client5>] [streamID 1722415560435622333] 
signal-1  | 2024-07-31T08:46:00Z INFO signal/server/signal.go:110: peer connected [<client4>] [streamID 1722415560465990002] 
signal-1  | 2024-07-31T08:46:01Z INFO signal/server/signal.go:95: peer disconnected [<client6>] [streamID 1722415501822241363] 
signal-1  | 2024-07-31T08:46:03Z INFO signal/server/signal.go:110: peer connected [<client6>] [streamID 1722415563203704936] 
signal-1  | 2024-07-31T08:46:34Z INFO signal/server/signal.go:95: peer disconnected [<client1>] [streamID 1722415534361482785] 
signal-1  | 2024-07-31T08:46:35Z INFO signal/server/signal.go:110: peer connected [<client1>] [streamID 1722415595862371424]

On the clients the following logs are produced:

2024-07-31T10:45:33+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:45:34+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:46:34+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:46:35+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:47:35+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:47:37+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:48:37+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:48:38+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:49:38+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:49:40+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:50:40+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:50:40+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream
2024-07-31T10:51:40+02:00 WARN signal/client/grpc.go:160: disconnected from the Signal service but will retry silently. Reason: rpc error: code = Internal desc = server closed the stream without sending trailers
2024-07-31T10:51:40+02:00 INFO signal/client/grpc.go:147: connected to the Signal Service stream

So it reconnects every minute because of an rpc error. But just by googling I cant figure out why. Again - everything is still working fine - at least I cant recognize any errors apart those logs.

To Reproduce

Steps to reproduce the behavior:

  1. Host your own netbird with traefik proxy.

    
    version: "3"
    services:
    #UI dashboard
    dashboard:
    image: netbirdio/dashboard:latest
    restart: unless-stopped
    #ports:
    #  - 80:80
    #  - 443:443
    environment:
      # Endpoints
      - NETBIRD_MGMT_API_ENDPOINT=https://mydomain.com:443
      - NETBIRD_MGMT_GRPC_API_ENDPOINT=https://mydomain.com:443
      # OIDC
      - AUTH_AUDIENCE=netbird-client
      - AUTH_CLIENT_ID=netbird-client
      - AUTH_CLIENT_SECRET=
      - AUTH_AUTHORITY=https://mykeycloak.com/realms/netbird
      - USE_AUTH0=false
      - AUTH_SUPPORTED_SCOPES=openid profile email offline_access api
      - AUTH_REDIRECT_URI=
      - AUTH_SILENT_REDIRECT_URI=
      - NETBIRD_TOKEN_SOURCE=accessToken
      # SSL
      - NGINX_SSL_PORT=443
    networks:
      default:
      traefik_net:
    labels:
      - traefik.enable=true
      - traefik.http.routers.netbird-dashboard.entrypoints=http
      - traefik.http.routers.netbird-dashboard.rule=Host(`mydomain.com`)
      - traefik.http.routers.netbird-dashboard.middlewares=to-https@file
      - traefik.http.routers.netbird-dashboard-secure.entrypoints=https
      - traefik.http.routers.netbird-dashboard-secure.rule=Host(`mydomain.com`)
      - traefik.http.routers.netbird-dashboard-secure.service=netbird-dashboard
      - traefik.http.services.netbird-dashboard.loadbalancer.server.port=80
    
    # Signal
    signal:
    image: netbirdio/signal:latest
    restart: unless-stopped
    volumes:
      - netbird-signal:/var/lib/netbird
    #ports:
    #  - 10000:80
    ## port and command for Let's Encrypt validation
    #  - 443:443
    #  command: ["--letsencrypt-domain", "", "--log-file", "console"]
    networks:
      default:
      traefik_net:
    labels:
    - traefik.enable=true
    - traefik.http.routers.netbird-signal.entrypoints=http
    - traefik.http.routers.netbird-signal.rule=Host(`mydomain.com`) && PathPrefix(`/signalexchange.SignalExchange/`)
    - traefik.http.routers.netbird-signal.middlewares=to-https@file
    - traefik.http.routers.netbird-signal-secure.entrypoints=https
    - traefik.http.routers.netbird-signal-secure.rule=Host(`mydomain.com`) && PathPrefix(`/signalexchange.SignalExchange`)
    - traefik.http.routers.netbird-signal-secure.service=netbird-signal
    - traefik.http.services.netbird-signal.loadbalancer.server.port=80
    - traefik.http.services.netbird-signal.loadbalancer.server.scheme=h2c
    
    # Management
    management:
    image: netbirdio/management:latest
    restart: unless-stopped
    depends_on:
      - dashboard
    volumes:
      - netbird-mgmt:/var/lib/netbird
      - ./management.json:/etc/netbird/management.json
    #ports:
    #  - 443:443 #API port
    #    # command for Let's Encrypt validation without dashboard container
    #    command: ["--letsencrypt-domain", "", "--log-file", "console"]
    command: [
      "--port", "443",
      "--log-file", "console",
      "--disable-anonymous-metrics=true",
      "--single-account-mode-domain=mydomain.com",
      "--dns-domain=netbird.selfhosted",
      "--user-delete-from-idp" # Added myself so netbird can handle users on its own
      ]
    networks:
      default:
      traefik_net:
    labels:
    - traefik.enable=true
    - traefik.http.routers.netbird-api.entrypoints=http
    - traefik.http.routers.netbird-api.rule=Host(`mydomain.com`) && PathPrefix(`/api`)
    - traefik.http.routers.netbird-api.middlewares=to-https@file
    - traefik.http.routers.netbird-api.service=netbird-api
    - traefik.http.routers.netbird-api-secure.entrypoints=https
    - traefik.http.routers.netbird-api-secure.rule=Host(`mydomain.com`) && PathPrefix(`/api`)
    - traefik.http.routers.netbird-api-secure.service=netbird-api
    - traefik.http.services.netbird-api.loadbalancer.server.port=443
    
    - traefik.http.routers.netbird-management.entrypoints=http
    - traefik.http.routers.netbird-management.rule=Host(`mydomain.com`) && PathPrefix(`/management.ManagementService/`)
    - traefik.http.routers.netbird-management.middlewares=to-https@file
    - traefik.http.routers.netbird-management.service=netbird-management
    - traefik.http.routers.netbird-management-secure.entrypoints=https
    - traefik.http.routers.netbird-management-secure.rule=Host(`mydomain.com`) && PathPrefix(`/management.ManagementService/`)
    - traefik.http.routers.netbird-management-secure.service=netbird-management
    - traefik.http.services.netbird-management.loadbalancer.server.port=443
    - traefik.http.services.netbird-management.loadbalancer.server.scheme=h2c
    
    # Coturn
    coturn:
    image: coturn/coturn:latest
    restart: unless-stopped
    domainname: mydomain.com
    volumes:
      - ./turnserver.conf:/etc/turnserver.conf:ro
    #      - ./privkey.pem:/etc/coturn/private/privkey.pem:ro
    #      - ./cert.pem:/etc/coturn/certs/cert.pem:ro
    network_mode: host
    command:
      - -c /etc/turnserver.conf

volumes: netbird-mgmt: netbird-signal:

networks: traefik_net: external: true

2. Run the services
3. Connect clients (I have clients using versions: `0.28.4`, `0.28.5` and `0.28.6` with all the same errors.)

**Expected behavior**

I expected it to be working without logs. It's indeed working, but with those logs that I dont know if they are problem. But guess it is one if they are produced - at least some misconfiguration.

**NetBird version**

Tested on `0.28.4`, `0.28.5` and `0.28.6`

**NetBird status -d output:**

Peers detail: peer1.netbird.selfhosted: NetBird IP: /32 Public key: BJjacf7HSPwCzkxFkUlFlWMBbXiFQyLzhc2z6YrBd3E= Status: Disconnected -- detail -- Connection type: P2P Direct: false ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): :51820/:62082 Last connection update: 20 minutes, 7 seconds ago Last WireGuard handshake: 2 minutes, 26 seconds ago Transfer status (received/sent) 2.1 KiB/3.8 KiB Quantum resistance: false Routes: - Latency: 0s

peer2.netbird.selfhosted: NetBird IP: /32 Public key: FGy/FK+e8p3qTFulkif4V+hI81U9I+QFKqttzBx7qE4= Status: Disconnected -- detail -- Connection type: P2P Direct: false ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): :51820/:62082 Last connection update: 20 minutes, 7 seconds ago Last WireGuard handshake: 2 minutes, 26 seconds ago Transfer status (received/sent) 2.1 KiB/3.8 KiB Quantum resistance: false Routes: - Latency: 0s

peer3.netbird.selfhosted: NetBird IP: Public key: qXy69ZUHM8IQJIapY/5z54AGTa2UyXPSefDISbzyByA= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): srflx/srflx ICE candidate endpoints (Local/Remote): :51820/:51820 Last connection update: 34 minutes, 36 seconds ago Last WireGuard handshake: 1 minute, 57 seconds ago Transfer status (received/sent) 3.0 MiB/25.1 MiB Quantum resistance: false Routes: - Latency: 12.297417ms

peer4.netbird.selfhosted: NetBird IP: Public key: MvDBVV63GFtK+MdEy3lV9/73Fuw5eUaRmNfQTMEi3DQ= Status: Connected -- detail -- Connection type: P2P Direct: true ICE candidate (Local/Remote): host/srflx ICE candidate endpoints (Local/Remote): :51820/:62082 Last connection update: 34 minutes, 38 seconds ago Last WireGuard handshake: 2 minutes, 26 seconds ago Transfer status (received/sent) 2.1 KiB/3.8 KiB Quantum resistance: false Routes: - Latency: 12.580678ms

OS: linux/amd64 Daemon version: 0.28.4 CLI version: 0.28.4 Management: Connected to https://mydomain.com:443 Signal: Connected to https://mydomain.com:443 Relays: [stun:mydomain.com:3478] is Available [turn:mydomain.com:3478?transport=udp] is Available Nameservers: [1.1.1.1:53, 1.0.0.1:53] for [.] is Available FQDN: peer5.netbird.selfhosted NetBird IP: /16 Interface type: Kernel Quantum resistance: false Routes: - Peers count: 2/4 Connected



Glad if anyone has an idea. If any information is missing please request it.

Have a good one!
Sleepingme00 commented 3 months ago

I also have the same issue, appreciate anyone give a suggestion.

lixmal commented 3 months ago

Do you have any proxy in front that cuts the grpc connection after some duration? Could be traefik itself

da-wilky commented 3 months ago

You are completly right, thank you for the hint! I had some research before in the direction of traefik, but didnt find anything, so I thought its about the signaling service, even tho it's indeed a traefik misconfiguration and not related to netbird.

Traefik has a default readTimeout of 60s. (https://doc.traefik.io/traefik/routing/entrypoints/#respondingtimeouts) This is to protect traefik from DoS Attacks, see the following resources:

Currently its not possible to set this timeout on an per router/service level with traefik (resource 1, resource 2). Hoepfully they plan to implement this feature.

So the current possibilities are:

  1. Live with the logs - it's still working fine and will be reconnecting immediatly after traefik cuts the connection. You might also increase the default duration by a few seconds to minimize the logs. Your decision if this is worth it - as said, its working fine and will be automatically reconnecting. You would just get a few less logs.
  2. Disable readTimeout by setting it to 0 with the huge drawback of exposing your traefik to the vulnerability.
  3. Switch from traefik to e.g. NGINX, that supports setting a custom read timeout per server/location.

Hope that helps somebody. Thanks to @lixmal for pointing the direction!

Cheers

Sleepingme00 commented 3 months ago

You are completly right, thank you for the hint! I had some research before in the direction of traefik, but didnt find anything, so I thought its about the signaling service, even tho it's indeed a traefik misconfiguration and not related to netbird.

Traefik has a default readTimeout of 60s. (https://doc.traefik.io/traefik/routing/entrypoints/#respondingtimeouts) This is to protect traefik from DoS Attacks, see the following resources:

Currently its not possible to set this timeout on an per router/service level with traefik (resource 1, resource 2). Hoepfully they plan to implement this feature.

So the current possibilities are:

  1. Live with the logs - it's still working fine and will be reconnecting immediatly after traefik cuts the connection. You might also increase the default duration by a few seconds to minimize the logs. Your decision if this is worth it - as said, its working fine and will be automatically reconnecting. You would just get a few less logs.
  2. Disable readTimeout by setting it to 0 with the huge drawback of exposing your traefik to the vulnerability.
  3. Switch from traefik to e.g. NGINX, that supports setting a custom read timeout per server/location.

Hope that helps somebody. Thanks to @lixmal for pointing the direction!

Cheers

Refer to @da-wilky 's suggestion, i also has updated the Traefik config as following.

entryPoints:
  http:
    address: ":80"
    http:
      redirections:
        entryPoint:
          to: https
          scheme: https
  https:
    address: ":443"
    transport:
      respondingTimeouts:
        readTimeout: 0

after retarting the sercies, peers are reconnecting every minute was gone! The corresponding risk also follows, of course, this balance can only be measured by myself. Such a great reminder, appreciate @da-wilky @lixmal 's guidance!

signal-1-log