gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.51k stars 552 forks source link

[Bug]: REST API doesn't restart properly #1082

Closed Doooooo0o closed 2 years ago

Doooooo0o commented 2 years ago

Useful note

Quick way to reproduce this bug

What happened?

Using docker-compose:

version: "3.4"

services:
  netmaker:
    container_name: netmaker
    image: gravitl/netmaker:v0.13.1
    volumes:
      - /etc/netmaker/dnsconfig:/root/config/dnsconfig
      - /usr/bin/wg:/usr/bin/wg
      - /etc/netmaker/data:/root/data
      - /etc/netmaker/certs:/etc/netmaker/
    cap_add: 
      - NET_ADMIN
      - NET_RAW
      - SYS_MODULE
    sysctls:
      - net.ipv4.ip_forward=1
      - net.ipv4.conf.all.src_valid_mark=1
      - net.ipv6.conf.all.disable_ipv6=1
      - net.ipv6.conf.all.forwarding=0
    restart: always
    environment:
      SERVER_NAME: "api.mydomain"
      SERVER_HOST: "my_public_ip"
      SERVER_API_HOST: "my_public_ip"
      SERVER_API_CONN_STRING: "api.mydomain:443"
      COREDNS_ADDR: "my_public_ip"
      DNS_MODE: "on"
      SERVER_HTTP_HOST: "api.mydomain"
      API_PORT: "8081"
      HTTP_PORT: "8081"
      CLIENT_MODE: "on"
      MASTER_KEY: "my_token"
      CORS_ALLOWED_ORIGIN: "*"
      DISPLAY_KEYS: "on"
      DATABASE: "postgres"
      REST_BACKEND: "on"
      SQL_PORT: 5432
      SQL_DB: netmaker
      SQL_HOST: 172.17.0.1
      SQL_USER: netmaker
      SQL_PASS: my_sql_pass
      TELEMETRY: "off"
      NODE_ID: "my_node"
      MQ_HOST: "my_public_ip:8883"
      HOST_NETWORK: "off"
      VERBOSITY: "3"
      MANAGE_IPTABLES: "on"
    ports:
      - "51821-51830:51821-51830/udp"
      - "127.0.0.1:8081:8081"
      - "172.17.0.1:8081:8081"
      - "my_public_ip:8081:8081"
  netmaker-ui:
    container_name: netmaker-ui
    depends_on:
      - netmaker
    image: gravitl/netmaker-ui:v0.13.1
    links:
      - "netmaker:api"
    ports:
      - "127.0.0.1:8082:80"
    environment:
      BACKEND_URL: "https://api.mydomain"
      VERBOSITY: "3"
    restart: always
  coredns:
    depends_on:
      - netmaker 
    image: coredns/coredns
    command: -conf /root/dnsconfig/Corefile
    container_name: coredns
    restart: always
    volumes:
      - /etc/netmaker/dnsconfig:/root/dnsconfig:rw
    ports:
      - "172.17.0.1:53:53/udp"
      - "172.17.0.1:53:53"
      - "my_public_ip:53:53"
      - "my_public_ip:53:53/udp"
  mq:
    image: eclipse-mosquitto:2.0.11-openssl
    depends_on:
      - netmaker
    container_name: mq
    restart: unless-stopped
    ports:
      - "127.0.0.1:1883:1883"
      - "172.17.0.1:1883:1883"
      - "172.17.0.1:8883:8883"
    volumes:
      - /etc/netmaker/mosquitto/:/mosquitto/config/:rw
      - /etc/netmaker/certs/:/mosquitto/certs/:rw
      - /etc/netmaker/mosquitto_data/:/mosquitto/data:rw
      - /etc/netmaker/mosquitto_logs/:/mosquitto/log:rw

I run:

$ docker-compose -f docker-compose-postgres.yml up -d 
Creating netmaker ... done
Creating netmaker-ui ... done
Creating coredns     ... done
Creating mq          ... done

Netmaker's daemons are behind an HAProxy instance which runs those configs:

frontend fe_mqtt_local
  mode tcp  
  bind my_public_ip:1883

  # Reject connections that have an invalid MQTT packet
  tcp-request content reject unless { req.payload(0,0),mqtt_is_valid }
  default_backend be_mqtt_local

backend be_mqtt_local
  mode tcp

  # Create a stick table for session persistence
  stick-table type string len 32 size 100k expire 30m

  # Use ClientID / client_identifier as persistence key
  stick on req.payload(0,0),mqtt_field_value(connect,client_identifier)
  server mosquitto1 172.17.0.1:1883 check fastinter 1s
frontend fe_mqtt
  mode tcp  
  bind my_public_ip:8883

  # Reject connections that have an invalid MQTT packet
  tcp-request content reject unless { req.payload(0,0),mqtt_is_valid }
  default_backend be_mqtt

backend be_mqtt
  mode tcp

  # Create a stick table for session persistence
  stick-table type string len 32 size 100k expire 30m

  # Use ClientID / client_identifier as persistence key
  stick on req.payload(0,0),mqtt_field_value(connect,client_identifier)

  server mosquitto1 172.17.0.1:8883 

On the https frontend:

        acl netmaker_dasboard                           hdr(host) -i d.my_domain
        acl netmaker_api                                hdr(host) -i api.my_domain
        acl netmaker_api_port                           hdr(host) -i api.my_domain:443
        use_backend netmaker_backend                    if netmaker_dasboard
        use_backend netmaker_api_backend                if netmaker_api
        use_backend netmaker_api_backend                if netmaker_api_port
backend netmaker_backend
        mode http
        server netmaker_dashboard 127.0.0.1:8082 check

backend netmaker_api_backend
        mode http
        server netmaker_api 127.0.0.1:8081 check

Everything goes OK. I register my clients, they are able to see and ping eachothers:

 $ netclient join -t token=
[netclient] 2022-05-12 14:00:22 joining netmaker at api.domain:443 
[netclient] 2022-05-12 14:00:23 starting wireguard 
[netclient] 2022-05-12 14:00:25 certificates/key saved  
[netclient] 2022-05-12 14:00:55 unable to connect to broker, retrying ... 
[netclient] 2022-05-12 14:00:55 could not connect to broker api.domain connect timeout 
[netclient] 2022-05-12 14:00:55 connection issue detected.. attempt connection with new certs 
[netclient] 2022-05-12 14:00:55 certificates/key saved  
[netclient] 2022-05-12 14:01:27 could not connect to broker at api.domain:8883 
[netclient] 2022-05-12 14:01:27 sent a node update to server for node servername ,  5ea9e7e0-4bdb-4957-849f-f877635e81a3 

 $ systemctl restart netclient ; netclient pull ; ping 192.168.15.2
[netclient] 2022-05-12 14:01:54 No network selected. Running Pull for all networks. 
[netclient] 2022-05-12 14:01:56 certificates/key saved  
PING 192.168.15.2 (192.168.15.2) 56(84) bytes of data.
64 bytes from 192.168.15.2: icmp_seq=1 ttl=64 time=42.1 ms
64 bytes from 192.168.15.2: icmp_seq=2 ttl=64 time=42.8 ms
^C
--- 192.168.15.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 3ms
rtt min/avg/max/mdev = 42.102/42.457/42.813/0.410 ms

Then, I want to run a recovery test. So:

$ docker rm -f netmaker
netmaker

follwed after by:

$ docker-compose -f docker-compose-postgres.yml up -d netmaker
Creating netmaker ... done

This is where my issue starts:

 docker exec -ti netmaker ss -lntpuae|rg -i listen
tcp   LISTEN 0      4096      127.0.0.11:33343        0.0.0.0:*    ino:30075388 sk:13e cgroup:unreachable:1 <->

Netmaker properly starts its wireguard client BUT it doesn't start the REST server, needed to access netmaker's ui. I have no relevant logline about this, eventhough VERBOSITY: "3" is set on my docker

Version

v0.13.1

What OS are you using?

Linux

Relevant log output

$ docker-compose logs -f netmaker
Attaching to netmaker
netmaker       | 
netmaker       |     ______     ______     ______     __   __   __     ______   __                        
netmaker       |    /\  ___\   /\  == \   /\  __ \   /\ \ / /  /\ \   /\__  _\ /\ \                       
netmaker       |    \ \ \__ \  \ \  __<   \ \  __ \  \ \ \'/   \ \ \  \/_/\ \/ \ \ \____                  
netmaker       |     \ \_____\  \ \_\ \_\  \ \_\ \_\  \ \__|    \ \_\    \ \_\  \ \_____\                 
netmaker       |      \/_____/   \/_/ /_/   \/_/\/_/   \/_/      \/_/     \/_/   \/_____/                 
netmaker       |                                                                                          
netmaker       |  __   __     ______     ______   __    __     ______     __  __     ______     ______    
netmaker       | /\ "-.\ \   /\  ___\   /\__  _\ /\ "-./  \   /\  __ \   /\ \/ /    /\  ___\   /\  == \   
netmaker       | \ \ \-.  \  \ \  __\   \/_/\ \/ \ \ \-./\ \  \ \  __ \  \ \  _"-.  \ \  __\   \ \  __<   
netmaker       |  \ \_\\"\_\  \ \_____\    \ \_\  \ \_\ \ \_\  \ \_\ \_\  \ \_\ \_\  \ \_____\  \ \_\ \_\ 
netmaker       |   \/_/ \/_/   \/_____/     \/_/   \/_/  \/_/   \/_/\/_/   \/_/\/_/   \/_____/   \/_/ /_/ 
netmaker       |                                                                                                             
netmaker       | 
netmaker       | [netmaker] 2022-05-12 12:03:28 connecting to postgres 
netmaker       | [netmaker] 2022-05-12 12:03:28 database successfully connected 
netmaker       | [netmaker] 2022-05-12 12:03:28 no OAuth provider found or not configured, continuing without OAuth 
netmaker       | [netmaker] 2022-05-12 12:03:28 could not set peers on network netmaker : file does not exist 
netmaker       | [netmaker] 2022-05-12 12:03:28 setting kernel device nm-netmaker 
netmaker       | [netmaker] 2022-05-12 12:03:28 adding address: 192.168.15.254 
netmaker       | [netmaker] 2022-05-12 12:03:28 finished setting wg config on server netmaker-1 
netmaker       | [netmaker] 2022-05-12 12:03:28 setting iptables forward policy 
netmaker       | [netmaker] 2022-05-12 12:03:28 checking keys and certificates 
netmaker       | [netmaker] 2022-05-12 12:03:28 publishing node update to servername
Doooooo0o commented 2 years ago

I tried to use sqlite, to see if it was an issue w/ postgres: the issue also exists with sqlite.

mattkasun commented 2 years ago

I ran your steps with standard caddy install and rest api is available. oot@server:~()# docker exec -it netmaker netstat -ntlp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.11:38067 0.0.0.0: LISTEN - tcp 0 0 :::8081 ::: LISTEN 1/netmaker

it would be interesting to see what happens in your setup up if you kill the netmaker-ui container whether the dashboard is available after restart.

Doooooo0o commented 2 years ago

@mattkasun dashboard (container netmaker-ui) is up and running, BUT since netmaker-ui is hitting netmaker's api on netmaker container, since I have no socket binding on 8081 inside the container, the dashboard is not usable. What test do you think I should run?

mattkasun commented 2 years ago

docker exec -it netmaker-ui netstat -ntlp

Doooooo0o commented 2 years ago
$ docker exec -it netmaker-ui netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      10/nginx: master pr
tcp        0      0 127.0.0.11:46111        0.0.0.0:*               LISTEN      -
Doooooo0o commented 2 years ago

I ran your steps with standard caddy install and rest api is available. oot@server:~()# docker exec -it netmaker netstat -ntlp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.11:38067 0.0.0.0: LISTEN - tcp 0 0 :::8081 ::: LISTEN 1/netmaker

Have you tried to add a network and peers before deleting netmaker container?

I re-tried from scratch:

$ netclient list|jq '.networks[].peers'|rg -v public
[
  {
    "addresses": [
      {
        "cidr": "192.168.15.1/32",
        "ip": "192.168.15.1"
      }
    ]
  },
  {
    "addresses": [
      {
        "cidr": "192.168.15.254/32",
        "ip": "192.168.15.254"
      }
    ]
  }
]

my peers are properly registered. The issue occurs after deleting netmaker after creating some peers. nb the netclient here is 192.168.15.2 and is reachable from both .1 and .254 before deleting netmaker's API container

mattkasun commented 2 years ago

I don't understand the need for multiple binds of port 8081 for the netmaker container:

Doooooo0o commented 2 years ago

the first two were for production purpose, I don't like having open ports for no reason as they are able to get properly behind haproxy.

I think I got something about this issue: It appears that netmaker API server has a need to reach mq at bootstrap time. There is no log about this on starting process. I replaced my public IP address on MQ_HOST: to "mq" and it fixed the issue.

So, 2 things here:

mattkasun commented 2 years ago

regarding your questions

  1. yes, if you are using caddy. There are some users using other reverse_proxies (haproxy, traefik) that support proxing mqtt traffic
  2. additional logging would not have helped ... the main goroutine was hung waiting for a connection to the misconfigured mq broker.. v0.14.0 adds a timeout so the connection won't hang anymore.