`netbird up` fails after `netbird down`

synfinatic commented 9 months ago

Describe the problem

Ran netbird down followed by netbird up on a RasPi running Linux/Debian 12. the up command failed with the errors:

$ netbird up
2024-02-23T16:01:09Z WARN client/cmd/root.go:195: retrying Login to the Management service in 1.127617486s due to error rpc error: code = Unknown desc = getting device authorization flow info failed with error: context deadline exceeded
2024-02-23T16:01:20Z WARN client/cmd/root.go:195: retrying Login to the Management service in 2.043948863s due to error rpc error: code = Unknown desc = getting device authorization flow info failed with error: context deadline exceeded
Error: login backoff cycle failed: rpc error: code = Unknown desc = getting device authorization flow info failed with error: context deadline exceeded

I diagnosed the root cause for this as being netbird up modified the /etc/resolv.conf file, but netbird down did not restore the original list of nameserver entries. Basically, the NetBird DNS server is not available when NetBird is down and so DNS resolution is failing. Manually editing the file and commenting out the line reading nameserver 100.93.254.165 fixed the issue.

To Reproduce

See above.

Expected behavior

netbird up succeeds

Are you using NetBird Cloud?

Yes.

NetBird version

0.25.7

NetBird status -d output:

If applicable, add the `netbird status -d' command output.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

pascal-fischer commented 9 months ago

Hi, could you run some tests, please? While you are connected to management can you run the following commands and send me the output:

sudo ls -l /var/lib/netbird
sudo cat /var/lib/netbird/manager
sudo ls -l /var/lib/netbird/resolv.conf
sudo ls -l /etc/resolv*

Thanks!

synfinatic commented 8 months ago

root@raspi-blue:~# ls -l /var/lib/netbird
total 8
-rw-r--r-- 1 root root 19 Feb 28 01:33 manager
-rw-r--r-- 1 root root 79 Feb 28 01:33 resolv.conf
root@raspi-blue:~#
root@raspi-blue:~#
root@raspi-blue:~# cat /var/lib/netbird/manager
file,100.93.254.165root@raspi-blue:~#
root@raspi-blue:~#
root@raspi-blue:~# ls -l /var/lib/netbird/resolv.conf
-rw-r--r-- 1 root root 79 Feb 28 01:33 /var/lib/netbird/resolv.conf
root@raspi-blue:~#
root@raspi-blue:~#
root@raspi-blue:~# ls -l /etc/resolv*
-rw-r--r-- 1 root root 217 Feb 28 01:33 /etc/resolv.conf
-rw-r--r-- 1 root root  79 Feb 28 01:25 /etc/resolv.conf.original.netbird

teoder commented 8 months ago

I'm currently having the exact same error on Linux machines when i try to use a Setup Key. Windows PCs can login with SSO, but takes a while and a couple of reconnects.

I'm running self-hosted version where everything but the Reverse-Proxy runs in docker containers. We use the same URL & Port for both Management and Admin URLs, is that a problem?

I Use nginx as reverse-proxy with the following configuration:

upstream dashboard {
  server 127.0.0.1:8180;
  keepalive 10;
}

upstream signal {
  server 127.0.0.1:8100;
}

upstream api {
  server 127.0.0.1:8443;
}

upstream management {
  server 127.0.0.1:8443;
}

server {
    listen 80;
    server_name _;

  # 301 redirect to HTTPS
    location / {
      return 301 https://$host$request_uri;
    }

}

server {
    # HTTPS server config
    listen 443 ssl http2;
    server_name _;
    access_log /var/log/nginx/access.log;
    client_header_timeout 1d;
    client_body_timeout 1d;

    proxy_set_header        X-Real-IP $remote_addr;
    proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header        X-Scheme $scheme;
    proxy_set_header        X-Forwarded-Proto https;
    proxy_set_header        X-Forwarded-Host $host;

    # Proxy dashboard
    location / {
        proxy_pass http://dashboard;
    }

    # Proxy Signal
    location /signalexchange.SignalExchange/ {
        grpc_pass grpc://signal;
        #grpc_ssl_verify off;
        grpc_read_timeout 1d;
        grpc_send_timeout 1d;
        grpc_socket_keepalive on;
    }
    # Proxy Management http endpoint
    location /api {
        proxy_pass http://api;
        proxy_set_header        X-Real-IP $remote_addr;
        proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header        X-Scheme $scheme;
        proxy_set_header        X-Forwarded-Proto https;
        proxy_set_header        X-Forwarded-Host $host;
        proxy_http_version 1.1;
    }
    # Proxy Management grpc endpoint
    location /management.ManagementService/ {
        grpc_pass grpc://management;
        #grpc_ssl_verify off;
        grpc_read_timeout 1d;
        grpc_send_timeout 1d;
        grpc_socket_keepalive on;
    }

    ssl_certificate /etc/nginx/certs/cert.crt;
    ssl_certificate_key /etc/nginx/certs/cert.key;

synfinatic commented 8 months ago

I doubt that it matters, but my device is also using a setup key and not SSO.

teoder commented 8 months ago

Mine does not come up at all (new client) so it's not the exact same scenario, but I'm thinking it has something to do with Linux Client and Setup Keys.

Windows Clients seems to be working as normal, the take a good time to connect which I'm also investigating.

mlsmaycon commented 8 months ago

Hello @synfinatic @teoder, The context deadline error indicates some timeout when communicating with the management service. The client have a timeout of 5 seconds which will be increased to 10 seconds with the next release (0.26.3).

Can you send the output from:

curl -o /dev/null -s -w "Time Connect: %{time_connect}s\nTime Start Transfer: %{time_starttransfer}s\nTotal Time: %{time_total}s\n" https://api.netbird.io/api/users

@teoder, replace https://api.netbird.io with your self-hosted management URL.

Also, can you please send us the logs from the daemon process? see https://docs.netbird.io/how-to/troubleshooting-client#getting-client-logs for reference.

synfinatic commented 8 months ago

$ curl -o /dev/null -s -w "Time Connect: %{time_connect}s\nTime Start Transfer: %{time_starttransfer}s\nTotal Time: %{time_total}s\n" https://api.netbird.io/api/users
Time Connect: 0.084451s
Time Start Transfer: 0.426194s
Total Time: 0.426519s

teoder commented 8 months ago

Hi @mlsmaycon I have both from Windows and Linux. Windows client can connect with SSO but takes a while, The Linux client does not connect and I dont get much output from the logs.

Windows_client.log

I haven't tried using SSO for Linux, only the Setup Key since it is supposed to be used as a routing-peer only.

From Windows Client (WSL):

# curl -o /dev/null -s -w "Time Connect: %{time_connect}s\nTime Start Transfer: %{time_starttransfer}s\nTotal Time: %{time_total}s\n" https:/secret.url.nu/api/users
Time Connect: 0.041514s
Time Start Transfer: 0.000000s
Total Time: 0.064417s

From non-working Linux box:

# curl -o /dev/null -s -w "Time Connect: %{time_connect}s\nTime Start Transfer: %{time_starttransfer}s\nTotal Time: %{time_total}s\n" https:/secret.url.nu/api/users
Time Connect: 0.015209s
Time Start Transfer: 0.000000s
Total Time: 0.067473s

Linux Client Logs give me the following, tried netbird up/login and just to start the service:

2024-03-05T11:58:48+01:00 INFO client/cmd/service_controller.go:24: starting Netbird service
2024-03-05T11:58:48+01:00 INFO client/cmd/service_controller.go:64: started daemon server: /var/run/netbird.sock
2024-03-05T11:58:48+01:00 INFO client/internal/connect.go:96: starting NetBird client version 0.26.2
2024-03-05T11:58:48+01:00 DEBG client/internal/connect.go:157: connecting to the Management service secret.url.nu:443
2024-03-05T11:58:53+01:00 ERRO management/client/grpc.go:64: failed creating connection to Management Service context deadline exceeded
2024-03-05T11:58:55+01:00 DEBG client/internal/connect.go:157: connecting to the Management service secret.url.nu:443
2024-03-05T11:59:00+01:00 ERRO management/client/grpc.go:64: failed creating connection to Management Service context deadline exceeded
2024-03-05T11:59:03+01:00 DEBG client/internal/connect.go:157: connecting to the Management service secret.url.nu:443
2024-03-05T11:59:08+01:00 ERRO management/client/grpc.go:64: failed creating connection to Management Service context deadline exceeded
2024-03-05T11:59:09+01:00 DEBG client/internal/connect.go:157: connecting to the Management service secret.url.nu:443
2024-03-05T11:59:14+01:00 ERRO management/client/grpc.go:64: failed creating connection to Management Service context deadline exceeded
2024-03-05T11:59:19+01:00 DEBG client/internal/connect.go:157: connecting to the Management service secret.url.nu:443
2024-03-05T11:59:24+01:00 ERRO management/client/grpc.go:64: failed creating connection to Management Service context deadline exceeded
2024-03-05T11:59:30+01:00 DEBG client/internal/connect.go:157: connecting to the Management service secret.url.nu:443
2024-03-05T11:59:35+01:00 ERRO management/client/grpc.go:64: failed creating connection to Management Service context deadline exceeded
2024-03-05T11:59:36+01:00 DEBG client/internal/login.go:93: connecting to the Management service https://secret.url.nu:443
2024-03-05T11:59:41+01:00 ERRO management/client/grpc.go:64: failed creating connection to Management Service context deadline exceeded
2024-03-05T11:59:41+01:00 ERRO client/internal/login.go:96: failed connecting to the Management service https://secret.url.nu:443 context deadline exceeded
2024-03-05T11:59:41+01:00 ERRO client/server/server.go:139: failed login: context deadline exceeded
2024-03-05T11:59:41+01:00 DEBG client/internal/login.go:93: connecting to the Management service https://secret.url.nu:443
2024-03-05T11:59:46+01:00 ERRO management/client/grpc.go:64: failed creating connection to Management Service context deadline exceeded
2024-03-05T11:59:46+01:00 ERRO client/internal/login.go:96: failed connecting to the Management service https://secret.url.nu:443 context deadline exceeded
2024-03-05T11:59:46+01:00 ERRO client/server/server.go:139: failed login: context deadline exceeded
2024-03-05T11:59:48+01:00 DEBG client/internal/login.go:93: connecting to the Management service https://secret.url.nu:443

teoder commented 8 months ago

Hello @synfinatic @teoder, The context deadline error indicates some timeout when communicating with the management service. The client have a timeout of 5 seconds which will be increased to 10 seconds with the next release (0.26.3).

Can you send the output from:
curl -o /dev/null -s -w "Time Connect: %{time_connect}s\nTime Start Transfer: %{time_starttransfer}s\nTotal Time: %{time_total}s\n" https://api.netbird.io/api/users
@teoder, replace https://api.netbird.io with your self-hosted management URL.

Also, can you please send us the logs from the daemon process? see https://docs.netbird.io/how-to/troubleshooting-client#getting-client-logs for reference.

@mlsmaycon So I think I've narrowed my problem down to the /api part of the nginx reverse-proxy, that section is not getting any access-requests... Do you think this could be related to me using the same fqdn for the API and Management section? I mean, I could always move this to another fqdn like api.domain.nu

mlsmaycon commented 8 months ago

No, the management service is the API. Often we see some configs missing grpc_pass parameters for the management protocol.

Can you share your nginx configuration?

teoder commented 8 months ago

No, the management service is the API. Often we see some configs missing grpc_pass parameters for the management protocol.

Can you share your nginx configuration?

@mlsmaycon Here it is:

upstream dashboard {
  server 127.0.0.1:8180;
  keepalive 10;
}

upstream signal {
  server 127.0.0.1:8100;
}

upstream management {
  server 127.0.0.1:8380;
}

server {
    listen 80;
    server_name test.url.com;

  # 301 redirect to HTTPS
    location / {
      return 301 https://$host$request_uri;
    }
}

server {
    # HTTPS server config
    listen 443 ssl http2;
    server_name test.url.com;

    access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log;

    # This is necessary so that grpc connections do not get closed early
    # see https://stackoverflow.com/a/67805465
    client_header_timeout 1d;
    client_body_timeout 1d;

    proxy_set_header        X-Real-IP $remote_addr;
    proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header        X-Scheme $scheme;
    proxy_set_header        X-Forwarded-Proto https;
    proxy_set_header        X-Forwarded-Host $host;
    grpc_set_header         X-Forwarded-For $proxy_add_x_forwarded_for;

    # Proxy dashboard
    location / {
        proxy_pass http://dashboard;
    }
    # Proxy Signal
    location /signalexchange.SignalExchange/ {
        access_log /var/log/nginx/signal_logging.log upstream_logging;
        error_log /var/log/nginx/signal_error_logging.log;
        grpc_pass grpc://signal;
        grpc_read_timeout 1d;
        grpc_send_timeout 1d;
        grpc_socket_keepalive on;
    }
    # Proxy Management http endpoint
    location /api {
        proxy_pass http://management;
    }
    # Proxy Management grpc endpoint
    location /management.ManagementService/ {
        grpc_pass grpc://management;
        grpc_read_timeout 1d;
        grpc_send_timeout 1d;
        grpc_socket_keepalive on;
    }
    ssl_certificate /etc/nginx/certs/cert.crt;
    ssl_certificate_key /etc/nginx/certs/cert.key;
}

teoder commented 8 months ago

So, In my case this problem was due to untrusted Let's Encrypt root certificate on the Linux host. I added the certificates to /etc/ssl/certs/ca-certificates.crt and I could connect without issues.

I ran the Client in the Foreground sudo bash -c 'GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info netbird up -F -l debug' and could see that there was a certificate issue.

netbirdio / netbird

`netbird up` fails after `netbird down` #1618