techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Unstable VIP PING & K3S too new for Rancher #47

Closed teknowill closed 2 years ago

teknowill commented 2 years ago

this link gets a 404, but i did the k3s trouble shooting check list, https://github.com/techno-tim/k3s-ansible/discussions/20

I was able to get this working with a older release with most current I get:

unstable Ping VIP IP# K3S need to be < v1.24 for Rancher

Expected Behavior

VIP end point ping should be stable Helm should be able to deploy Rancher with documented commands

Current Behavior

You can ping the VIP/API IP# intermittently , so get node and helm deployments are hit or miss Rancher Deployment will fail because K3S is > 1.24

Steps to Reproduce

  1. deploy with code base that had all.yml Commits on May 26, 2022, 3 etcd control, 5 workers Proxmox VM (across there physical nodes)
  2. deploy longhorn in default space, shared with workers (just learning the process)
  3. deploy minecraft > everything was stable for weeks, though it kept over poding one node
  4. backup, take down minecraft and longhorn
  5. reset
  6. git pull on 8/1/22, use latest which has changes to, all.yml, main.yml, metallb configmap metallb ipaddresspool, metallb yamls, vip rbac, vip yaml
  7. add 3 VM for dedicated longhorn > verify ansible can apt update and install proxmox guest agent and can password less ssh
  8. add 3x IP# under nodes (3 control, 5 workers, 3 worker for longhorn only)
  9. deploy with 8/1/22 version

Context (variables)

Operating system: Ubuntu 22

Hardware: 2x dual xeon nodes, 48 thread, 256GB Ram, 1x 1 liter node with a i5 10th 64GB Ram

Variables Used:

I didn't alter these, save adding my own token and IP# they are what have been listed in the repo

all.yml

k3s_version: "1.24"
ansible_user: NA
systemd_dir: ""

flannel_iface: ""

apiserver_endpoint: ""

k3s_token: "NA"

extra_server_args: ""
extra_agent_args: ""

kube_vip_tag_version: ""

metal_lb_speaker_tag_version: ""
metal_lb_controller_tag_version: ""

metal_lb_ip_range: ""

Hosts

host.ini

[master]
IP.ADDRESS.ONE
IP.ADDRESS.TWO
IP.ADDRESS.THREE

[node]
IP.ADDRESS.FOUR
IP.ADDRESS.FIVE

[k3s_cluster:children]
master
node

Possible Solution

It seems to deploy ok, just something isn't quite the same when it comes to the VIP/API IP# access From what I can the version changes are very particular

I tried taking various nodes offline one at a time, this didn't really help in any repeatable way. So I don't think it's any one node/vm or it's physical networking

tried just going to a older k3s , which might of helped rancher, but didn't help vip/metallb. I'll try to see if I can find a vip or metallb log file (but I don't really know vip or metallb)

for now, going to try reversing the pull request and deployment with older stack

teknowill commented 2 years ago

kept newest code, but going back to all.yml with:

k3s_version: v1.23.4+k3s1

this is the user that has ssh access to these machines

ansible_user: ---- systemd_dir: /etc/systemd/system

Set your timezone

system_timezone: "America/New_York"

interface which will be used for flannel

flannel_iface: "eth0"

apiserver_endpoint is virtual ip-address which will be configured on each master

apiserver_endpoint: "192.168.1.220"

k3s_token is required masters can talk together securely

this token should be alpha numeric only

k3s_token: "some-SUPER-DEDEUPER-secret-password"

change these to your liking, the only required one is--no-deploy servicelb

extra_server_args: "--no-deploy servicelb --no-deploy traefik" extra_agent_args: ""

image tag for kube-vip

kube_vip_tag_version: "v0.4.4"

image tag for metal lb

metal_lb_speaker_tag_version: "v0.12.1" metal_lb_controller_tag_version: "v0.12.1"

metallb ip range for load balancer

metal_lb_ip_range: "192.168.1.221-192.168.1.239"


the api IP# and get nodes are now stable rancher will deploy

however after kubectl expose deployment rancher -n cattle-system --type=LoadBalancer --name=rancher-lb --port=443

external IP# is pending forever

teknowill commented 2 years ago

kubectl get all -A NAMESPACE NAME READY STATUS RESTARTS AGE cattle-fleet-local-system pod/fleet-agent-699b5fb945-nsxnx 1/1 Running 0 13m cattle-fleet-system pod/fleet-controller-784d6fbcd8-hngpn 1/1 Running 0 14m cattle-fleet-system pod/gitjob-6b977748fc-7rsh8 1/1 Running 0 14m cattle-system pod/helm-operation-7vpxj 0/2 Completed 0 14m cattle-system pod/helm-operation-8m924 0/2 Completed 0 13m cattle-system pod/helm-operation-fcx2g 0/2 Completed 0 15m cattle-system pod/helm-operation-v89kg 0/2 Completed 0 14m cattle-system pod/rancher-7fd65d9cd6-2f5vv 1/1 Running 0 17m cattle-system pod/rancher-7fd65d9cd6-qlqp7 1/1 Running 0 17m cattle-system pod/rancher-7fd65d9cd6-slnqr 1/1 Running 0 17m cattle-system pod/rancher-webhook-5b65595df9-l5b7x 1/1 Running 0 13m cert-manager pod/cert-manager-76d44b459c-kdzqv 1/1 Running 0 18m cert-manager pod/cert-manager-cainjector-9b679cc6-wp959 1/1 Running 0 18m cert-manager pod/cert-manager-webhook-57c994b6b9-tqgnv 1/1 Running 0 18m kube-system pod/coredns-5789895cd-cbzzm 1/1 Running 0 56m kube-system pod/kube-vip-ds-4lntv 1/1 Running 0 56m kube-system pod/kube-vip-ds-j65z7 1/1 Running 2 (16m ago) 56m kube-system pod/kube-vip-ds-m8vcq 1/1 Running 0 56m kube-system pod/local-path-provisioner-6c79684f77-zk8jj 1/1 Running 0 56m kube-system pod/metrics-server-7cd5fcb6b7-czzjp 1/1 Running 0 56m metallb-system pod/controller-74df79bb55-qvldk 1/1 Running 0 56m metallb-system pod/speaker-28fk6 1/1 Running 0 53m metallb-system pod/speaker-2mhzf 1/1 Running 0 56m metallb-system pod/speaker-8zwrg 1/1 Running 0 53m metallb-system pod/speaker-96mb5 1/1 Running 0 53m metallb-system pod/speaker-bmhpn 1/1 Running 0 56m metallb-system pod/speaker-jggcr 1/1 Running 0 53m metallb-system pod/speaker-mr7mc 1/1 Running 0 53m metallb-system pod/speaker-rb8dp 1/1 Running 0 53m metallb-system pod/speaker-rkktx 1/1 Running 0 53m metallb-system pod/speaker-t89s6 1/1 Running 0 53m metallb-system pod/speaker-v7vss 1/1 Running 0 56m

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE cattle-fleet-system service/gitjob ClusterIP 10.43.59.181 80/TCP 14m cattle-system service/rancher ClusterIP 10.43.230.173 80/TCP,443/TCP 17m cattle-system service/rancher-lb LoadBalancer 10.43.192.203 443:31202/TCP 11m cattle-system service/rancher-webhook ClusterIP 10.43.112.129 443/TCP 13m cattle-system service/webhook-service ClusterIP 10.43.111.116 443/TCP 13m cert-manager service/cert-manager ClusterIP 10.43.247.67 9402/TCP 18m cert-manager service/cert-manager-webhook ClusterIP 10.43.170.208 443/TCP 18m default service/kubernetes ClusterIP 10.43.0.1 443/TCP 57m kube-system service/kube-dns ClusterIP 10.43.0.10 53/UDP,53/TCP,9153/TCP 56m kube-system service/metrics-server ClusterIP 10.43.76.55 443/TCP 56m metallb-system service/webhook-service ClusterIP 10.43.125.245 443/TCP 56m

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-system daemonset.apps/kube-vip-ds 3 3 3 3 3 56m metallb-system daemonset.apps/speaker 11 11 11 11 11 kubernetes.io/os=linux 56m

NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE cattle-fleet-local-system deployment.apps/fleet-agent 1/1 1 1 13m cattle-fleet-system deployment.apps/fleet-controller 1/1 1 1 14m cattle-fleet-system deployment.apps/gitjob 1/1 1 1 14m cattle-system deployment.apps/rancher 3/3 3 3 17m cattle-system deployment.apps/rancher-webhook 1/1 1 1 13m cert-manager deployment.apps/cert-manager 1/1 1 1 18m cert-manager deployment.apps/cert-manager-cainjector 1/1 1 1 18m cert-manager deployment.apps/cert-manager-webhook 1/1 1 1 18m kube-system deployment.apps/coredns 1/1 1 1 56m kube-system deployment.apps/local-path-provisioner 1/1 1 1 56m kube-system deployment.apps/metrics-server 1/1 1 1 56m metallb-system deployment.apps/controller 1/1 1 1 56m

NAMESPACE NAME DESIRED CURRENT READY AGE cattle-fleet-local-system replicaset.apps/fleet-agent-699b5fb945 1 1 1 13m cattle-fleet-local-system replicaset.apps/fleet-agent-86b78d86bf 0 0 0 13m cattle-fleet-system replicaset.apps/fleet-controller-784d6fbcd8 1 1 1 14m cattle-fleet-system replicaset.apps/gitjob-6b977748fc 1 1 1 14m cattle-system replicaset.apps/rancher-7fd65d9cd6 3 3 3 17m cattle-system replicaset.apps/rancher-webhook-5b65595df9 1 1 1 13m cert-manager replicaset.apps/cert-manager-76d44b459c 1 1 1 18m cert-manager replicaset.apps/cert-manager-cainjector-9b679cc6 1 1 1 18m cert-manager replicaset.apps/cert-manager-webhook-57c994b6b9 1 1 1 18m kube-system replicaset.apps/coredns-5789895cd 1 1 1 56m kube-system replicaset.apps/local-path-provisioner-6c79684f77 1 1 1 56m kube-system replicaset.apps/metrics-server-7cd5fcb6b7 1 1 1 56m metallb-system replicaset.apps/controller-74df79bb55 1 1 1 56m

teknowill commented 2 years ago

api enpoint ip stable ping with all below

will try again with newer k3s

but with 1.23

helm install cert-manager jetstack/cert-manager --namespace cert-manager --version v1.7.1 gets stuck even though kubectl get pods --namespace cert-manager NAME READY STATUS RESTARTS AGE cert-manager-76d44b459c-wr4bp 1/1 Running 0 3m3s cert-manager-cainjector-9b679cc6-nnx9m 1/1 Running 0 3m3s cert-manager-startupapicheck-nrrkb 1/1 Running 2 (58s ago) 3m2s cert-manager-webhook-57c994b6b9-7w959 1/1 Running 0 3m3s


k3s_version: v1.23.4+k3s1

this is the user that has ssh access to these machines

ansible_user: ---- systemd_dir: /etc/systemd/system

Set your timezone

system_timezone: "America/New_York"

interface which will be used for flannel

flannel_iface: "eth0"

apiserver_endpoint is virtual ip-address which will be configured on each master

apiserver_endpoint: "192.168.1.220"

k3s_token is required masters can talk together securely

this token should be alpha numeric only

k3s_token: "some-SUPER-DEDEUPER-secret-password"

change these to your liking, the only required one is--no-deploy servicelb

extra_server_args: "--no-deploy servicelb --no-deploy traefik" extra_agent_args: ""

image tag for kube-vip

kube_vip_tag_version: "v0.4.4"

kube_vip_tag_version: "v0.5.0"

image tag for metal lb

metal_lb_speaker_tag_version: "v0.12.1"

metal_lb_controller_tag_version: "v0.12.1"

metal_lb_speaker_tag_version: "v0.13.4" metal_lb_controller_tag_version: "v0.13.4"

metallb ip range for load balancer

metal_lb_ip_range: "192.168.1.221-192.168.1.239"

teknowill commented 2 years ago

kubectl get pods --namespace cert-manager NAME READY STATUS RESTARTS AGE cert-manager-76d44b459c-wr4bp 1/1 Running 0 5m6s cert-manager-cainjector-9b679cc6-nnx9m 1/1 Running 0 5m6s cert-manager-startupapicheck-nrrkb 0/1 CrashLoopBackOff 3 (18s ago) 5m5s cert-manager-webhook-57c994b6b9-7w959 1/1 Running 0 5m6s

teknowill commented 2 years ago

https://github.com/cert-manager/cert-manager/issues/2773 helps if I don't skip a line in the docs...

teknowill commented 2 years ago

ok get nodes and ping to the API are stable, using the newest all,yml able to cert manager


k3s_version: v1.24.3+k3s1

k3s_version: v1.23.4+k3s1

this is the user that has ssh access to these machines

ansible_user: --- systemd_dir: /etc/systemd/system

Set your timezone

system_timezone: "America/New_York"

interface which will be used for flannel

flannel_iface: "eth0"

apiserver_endpoint is virtual ip-address which will be configured on each master

apiserver_endpoint: "192.168.1.220"

k3s_token is required masters can talk together securely

this token should be alpha numeric only

k3s_token: "some-SUPER-DEDEUPER-secret-password"

change these to your liking, the only required one is--no-deploy servicelb

extra_server_args: "--no-deploy servicelb --no-deploy traefik" extra_agent_args: ""

image tag for kube-vip

kube_vip_tag_version: "v0.4.4"

kube_vip_tag_version: "v0.5.0"

image tag for metal lb

metal_lb_speaker_tag_version: "v0.12.1"

metal_lb_controller_tag_version: "v0.12.1"

metal_lb_speaker_tag_version: "v0.13.4" metal_lb_controller_tag_version: "v0.13.4"

metallb ip range for load balancer

metal_lb_ip_range: "192.168.1.221-192.168.1.239"

teknowill commented 2 years ago

helm install rancher rancher-stable/rancher \ --namespace cattle-system ....

Error: INSTALLATION FAILED: chart requires kubeVersion: < 1.24.0-0 which is incompatible with Kubernetes v1.24.3+k3s1

looks there is 1.24 rancher out there https://github.com/rancher/client-go/releases/tag/v1.24.0-rancher1

but not fully ready yet https://github.com/rancher/rancher/issues/37711

trying to figure out how to point to it, but guess I just need to stick with k3s_version: v1.23.4+k3s1 for now, but at least I can use the newer VIP and metalLB, not sure what is "different"

teknowill commented 2 years ago

Yea that did it NAME: rancher LAST DEPLOYED: Mon Aug 1 19:10:43 2022 NAMESPACE: cattle-system STATUS: deployed

suggest setting all.yml back to v1.23.4+k3s1

until 1.24 rancher is ready, at least in your main branch

timothystewart6 commented 2 years ago

Rancher is not yet compatible with k3s 1.24. It may be soon but that is really going to be up to Rancher to make it compatible.

timothystewart6 commented 2 years ago

I think support is coming in k3s 2.67, I would check their release notes.

timothystewart6 commented 2 years ago

also rancher isn't compatible with the latest cert-manager

timothystewart6 commented 2 years ago

also, I've been pining my vip for over an hour now and it's stable.

teknowill commented 2 years ago

after the did a pull I had a unstable ping for 2x re-deployment rounds with no error in output before I put up the post. I think just something minor was off in my all.yml copy, as later deployments were very stable. I mentioned in later comment that I ended up being able to get stable ping, sorry you wasted time pining. can no longer reproduce.

Re: Support is coming in 2.67, thank you, yes found that as well. https://github.com/rancher/rancher/issues/37711

Re: 1.24, just trying to point out that the source copy of all.yml on your repo has a line that installs 1.24 I know a k3s deployment and rancher are separate, and I don't know the k3s deployment's target, but if people follow your guidance docs & videos, pull /clone the all.yml script that points to 1.24 and try to deploy rancher they'll run into it, not a big deal.

just suggesting a known rancher compatible version stack/branch for your k3s deployment, as well as a latest k3s stack, in this case it's only a line, but I think rancher will always lag behind k3s and you also mentioned cert manager versions, could be other things down the line. I could try to expand an ansiable script for a rancher on top stack, post it, if there's any interest / value

timothystewart6 commented 2 years ago

Thank you for bringing this up and all the details