vitobotta / hetzner-k3s

The easiest and fastest way to create and manage Kubernetes clusters in Hetzner Cloud using the lightweight distribution k3s by Rancher.
MIT License
1.91k stars 141 forks source link

k3s cluster and nat gateway #454

Open mertcangokgoz opened 2 months ago

mertcangokgoz commented 2 months ago

I am currently using nat gateway in my project, I need k3s and I want to communicate my cluster only with private ip without any public ip address. I am using debian-12 image in the cluster.

As a result of this configuration, I expect the machines to go to the internet and at the same time the pods to stand up. However, during the installation, it makes an output like the following, I think the installation is not completed in a healthy way.

image
vitobotta commented 2 months ago

Hi, do you see the server(s) attached to the main-vpc-network network in the Hetzner Console? If yes do they get an IP in that network?

vitobotta commented 2 months ago

Please SSH into one of the servers attached to the network and run

SUBNET="10.13.0.0/16"
SUBNET_PREFIX=$(echo $SUBNET | cut -d'/' -f1 | sed 's/\./\\./g' | sed 's/0$//')

echo $SUBNET_PREFIX 

Does it return the correct prefix?

Then run

ip -4 addr show | grep -q "inet $SUBNET_PREFIX" 

What does it return?

vitobotta commented 2 months ago

My gut feeling is that there is something wrong with your post_create_commands.

Attach a temp server to the same network, then SSH into it and with /bin/sh, not bash (since Cloud Init script must work in regular sh shell) try running your post create commands and see if all of them work just fine.

vitobotta commented 2 months ago

What do you get with ip -4 addr show?

vitobotta commented 2 months ago

Can you try ip -4 addr show | grep "inet $SUBNET_PREFIX" without -q? Trying to replicate what happens during the installation.

mertcangokgoz commented 2 months ago

@vitobotta

I changed the subnet and the problem disappeared(I don't know if it has something to do with the subnets I've split.), of course I haven't included post_create_commands yet, but I get a situation like the following, is this coming from ssh?

[Instance blackhole-k3s-cluster-pool-small-static-pool-worker3] Waiting for successful ssh connectivity with instance blackhole-k3s-cluster-pool-small-static-pool-worker3...
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker2] Waiting for successful ssh connectivity with instance blackhole-k3s-cluster-pool-small-static-pool-worker2...
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker1] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker1 is now up.
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker1] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker1 created
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker3] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker3 is now up.
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker3] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker3 created
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker2] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker2 is now up.
[Instance blackhole-k3s-cluster-pool-small-static-pool-worker2] ...instance blackhole-k3s-cluster-pool-small-static-pool-worker2 created
Unhandled exception in spawn: timeout after 00:00:30 (Tasker::Timeout)
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.8/bin/hetzner-k3s in 'raise<Tasker::Timeout>:NoReturn'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.8/bin/hetzner-k3s in 'Tasker@Tasker::Methods::timeout<Time::Span, &Proc(Nil)>:Nil'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.8/bin/hetzner-k3s in '~procProc(Nil)@src/cluster/create.cr:75'
  from /opt/homebrew/Cellar/hetzner_k3s/2.0.8/bin/hetzner-k3s in 'Fiber#run:(IO::FileDescriptor | Nil)'
vitobotta commented 2 months ago

Yeah that may be a problem with SSH, perhaps with the key. Can you try enabling the agent?

mertcangokgoz commented 2 months ago

Yeah that may be a problem with SSH, perhaps with the key. Can you try enabling the agent?

Are you talking about the use_agent setting? But there is no password in the key I created.

vitobotta commented 2 months ago

Another possibility may be some issue with Debian due to some recent changes made to address the new way of handling custom ssh ports in newer versions of Ubuntu. Can you try with Ubuntu but with the same configuration to see if that's the problem?

mertcangokgoz commented 2 months ago
image

Thank you very much for your help, I have one last question. all of the machines have internet access, my configurations are correct, but the following warning comes, is this normal for autoscale

image

apart from this I get the following warning yes it is set so that pods cannot be opened on the master node but hetzner did not create pods on other nodes for csi-controller etc. but opened 3 machines

Is this a normal process

vitobotta commented 2 months ago

It's not a warning :) It's just telling you that some ponds were probably pending due to lack of resources so the cluster had to scale up. Did it add a new node?

mertcangokgoz commented 2 months ago
image

yes it added 3 nodes, I can not see the added ones with the kubectl get nodes command, it seems that I have 3 master 3 workers now.

vitobotta commented 2 months ago

What do you see in the autoscaler's logs?

mertcangokgoz commented 1 month ago

I managed to run it properly, I think I will write a small article on the subject to my blog address.

Thank you very much for your help.

I just want to ask a very small question

  private_network:
    enabled: true
    subnet: 10.14.3.0/24
    existing_network_name: ‘main-vpc-network’

Even if I configure it as such, why would it be receiving ip over 10.14.1.0/24.

vitobotta commented 1 month ago

Can you share the solution for posterity?

Can you also clarify the question? :p

mertcangokgoz commented 1 month ago

autoscaler stopped working even though I made no changes,

1- turning on the machine I see from the hetzner cloud panel 2- I see it getting the private ip address via dhcp. 3- It seems to be starting to make installations

There's nothing after that, I'm tied up because I don't have ssh access. I can't see the logs. It's like it's not doing calm installations. clustera doesn't even include the node.

The machine has only private ip address behind NAT gateway. Routing is full there is no problem there either. I organised it according to the documentation.

How can I debug this situation?

mertcangokgoz commented 1 month ago

I finally managed to solve the problem, due to the lack of public ip, the installations started to be incomplete due to both route and dns problems.

I don't know how this happened, but I solved the situation by manually intervening in the cloud-init config.

On machines with NAT gateway, the route and dns configuration needs to be run before all processes. Even if we add post_create_commands to the top, it runs at the bottom. https://github.com/vitobotta/hetzner-k3s/blob/60b862b3105d6a7362f5754ee83b5f91a2014984/templates/cloud_init.yaml#L35 I noticed that the configuration we added here was not added to the top.

vitobotta commented 1 month ago

I am sorry, but I am not following. Can you clarify what exactly fixed your problem and what changes you needed to make to hetzner-k3s to to solve it? I could make a new release with your fixes or you could make a PR if you are up to it. :)

mertcangokgoz commented 1 month ago

In a k8s structure where there is no public network, the following should be implemented.

1-network settings should be made and nat gateway should be configured.

  # Add network interface to route nat gateway
  - |
    cat <<'EOF' >> /etc/systemd/network/10-enp7s0.network
    [Match]
    Name=enp7s0

    [Network]
    DHCP=yes
    Gateway=10.144.0.1
    EOF
  # reload networkd
  - systemctl restart systemd-networkd
  # Configure systemd-resolved
  - systemctl enable systemd-resolved
  - systemctl start systemd-resolved
  # Set DNS
  - |
    cat <<'EOF' >> /etc/systemd/resolved.conf
    [Resolve]
    Cache=yes
    DNS=185.12.64.1 185.12.64.2
    FallbackDNS=1.1.1.1
    EOF
  - systemctl daemon-reload
  - systemctl restart systemd-resolved

2- packages should not be installed with packages: command (Packages should be included in the system immediately after cloud-init network settings.)

so the cloud-init file has to be like this. If ipv4 and ipv6 are completely off

#cloud-config
preserve_hostname: true

write_files:

- path: /etc/systemd/system/ssh.socket.d/listen.conf
  content: |
    [Socket]
    ListenStream=
    ListenStream=22

- path: /etc/configure-ssh.sh
  permissions: '0755'
  content: |
    if systemctl is-active ssh.socket > /dev/null 2>&1
    then
      # OpenSSH is using socket activation
      systemctl disable ssh
      systemctl daemon-reload
      systemctl restart ssh.socket
      systemctl stop ssh
    else
      # OpenSSH is not using socket activation
      sed -i 's/^#*Port .*/Port 22/' /etc/ssh/sshd_config
    fi
    systemctl restart ssh

runcmd:
- hostnamectl set-hostname $(curl http://169.254.169.254/hetzner/v1/metadata/hostname)
- update-crypto-policies --set DEFAULT:SHA1 || true
- /etc/configure-ssh.sh
- |
  cat <<'EOF' >> /etc/systemd/network/10-enp7s0.network
  [Match]
  Name=enp7s0

  [Network]
  DHCP=yes
  Gateway=10.144.0.1
  EOF
# reload networkd
- systemctl restart systemd-networkd
# Configure systemd-resolved
- systemctl enable systemd-resolved
- systemctl start systemd-resolved
# Set DNS
- |
  cat <<'EOF' >> /etc/systemd/resolved.conf
  [Resolve]
  Cache=yes
  DNS=185.12.64.1 185.12.64.2
  FallbackDNS=1.1.1.1
  EOF
- systemctl daemon-reload
- systemctl restart systemd-resolved
- apt update & apt-get install -y ifupdown net-tools
- echo "nameserver 8.8.8.8" > /etc/k8s-resolv.conf
- |
    touch /etc/initialized

    HOSTNAME=$(hostname -f)
    PUBLIC_IP=$(hostname -I | awk '{print $1}')

    if [ "true" = "true" ]; then
      echo "Using private network " > /var/log/hetzner-k3s.log
      SUBNET="10.144.1.0/24"
      SUBNET_PREFIX=$(echo $SUBNET | cut -d'/' -f1 | sed 's/\./\\./g' | sed 's/0$//')
      MAX_ATTEMPTS=30
      DELAY=10
      UP="false"

      for i in $(seq 1 $MAX_ATTEMPTS); do
        if ip -4 addr show | grep -q "inet $SUBNET_PREFIX"; then
          echo "Private network IP in subnet $SUBNET is up" 2>&1 | tee -a /var/log/hetzner-k3s.log
          UP="true"
          break
        fi
        echo "Waiting for private network IP in subnet $SUBNET to be available... (Attempt $i/$MAX_ATTEMPTS)" 2>&1 | tee -a /var/log/hetzner-k3s.log
        sleep $DELAY
      done

      if [ "$UP" = "false" ]; then
        echo "Timeout waiting for private network IP in subnet $SUBNET" 2>&1 | tee -a /var/log/hetzner-k3s.log
      fi

      PRIVATE_IP=$(ip route get 10.144.1.0 | awk -F"src " 'NR==1{split($2,a," ");print a[1]}')
      NETWORK_INTERFACE=" --flannel-iface=$(ip route get 10.144.1.0 | awk -F"dev " 'NR==1{split($2,a," ");print a[1]}') "
    else
      echo "Using public network " > /var/log/hetzner-k3s.log
      PRIVATE_IP="${PUBLIC_IP}"
      NETWORK_INTERFACE=" "
    fi

    mkdir -p /etc/rancher/k3s

    cat > /etc/rancher/k3s/registries.yaml <<EOF
    mirrors:
      "*":
    EOF

    curl -sfL https://get.k3s.io | K3S_TOKEN="REDACTED" INSTALL_K3S_VERSION="v1.31.1+k3s1" K3S_URL=https://10.144.1.16:6443 INSTALL_K3S_EXEC="agent \
    --node-name=$HOSTNAME  --kubelet-arg "cloud-provider=external"  --kubelet-arg "resolv-conf=/etc/k8s-resolv.conf"  \
    --node-ip=$PRIVATE_IP \
    --node-external-ip=$PUBLIC_IP \
    $NETWORK_INTERFACE " sh -

    echo true > /etc/initialized

Unfortunately, I cannot support the project because I do not know the software language in which the project is developed :)

vitobotta commented 1 month ago

Thanks for clarifying! I see what you mean now. I will do some testing and see if I can release some changes that might help with this kind of setup in the next release.

dyipon commented 1 month ago

I would like to confirm whether the solution functions correctly when public IP addresses are completely disabled. While the process is slow, taking around 6-7 minutes to create a small cluster, it still works as expected. I tested this without modifying the cloud-init configuration, using only the post-commands. Thanks you @mertcangokgoz