rancher / k3os

Purpose-built OS for Kubernetes, fully managed by Kubernetes.
https://k3os.io
Apache License 2.0
3.5k stars 397 forks source link

EXTERNAL-IP <pending> forever #208

Open frafra opened 4 years ago

frafra commented 4 years ago

How to reproduce:

ip="your_server_ip"
rm -r ~/.kube
mkdir ~/.kube
scp rancher@$ip:/etc/rancher/k3s/k3s.yaml ~/.kube/config
sed -i "s/localhost/$ip/g" ~/.kube/config

ssh rancher@$ip mkdir /home/rancher/local-path-provisioner
curl -s https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml | sed 's;/opt;/home/rancher;g' | kubectl apply -f -
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

kubectl -n kube-system create sa tiller
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --history-max 200

helm install --name test-wordpress stable/wordpress

Result:

$ kubectl get svc --namespace default test-wordpress
NAME             TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
test-wordpress   LoadBalancer   10.43.29.245   <pending>     80:30003/TCP,443:31258/TCP   10m

Additional details:

dweomer commented 4 years ago

@frafra do you run into the same issue when downloading and installing helm onto the k3os host and invoking from there?

frafra commented 4 years ago

@dweomer I actually get a different error:

k3os-0 [~]$ helm init --service-account tiller --history-max 200
$HELM_HOME has been configured at /home/rancher/.helm.
Error: error installing: Post http://localhost:8080/apis/extensions/v1beta1/namespaces/kube-system/deployments: dial tcp 127.0.0.1:8080: connect: connection refused
dweomer commented 4 years ago

@frafra did you make sure that k3s was fully up and running (it can take anywhere from 20 seconds to a few minutes) before attempting the helm invocation? Take a look at https://github.com/rancher/k3s/blob/master/scripts/sonobuoy for some inspiration on automating such. Specifically note that there are effectively three wait phases: kubeconfig, nodes, services and that you will wan to to include traefik along with coredns in with the services to wait on. If that's all a bit too much and/or you just need to wait manually I have found that invoking kubectl get -A pods a number of times until there are four pod lines, the helm line at "Completed" status with the others at "Running"

frafra commented 4 years ago

@dweomer thanks, but I waited > 20', and it still fails. Helm v2.14.3.

zimme commented 4 years ago

@dweomer I'm seeing pretty much the same problem when I try and run this on my hetzner instance.

➜  ~ kubectl get all -A
NAMESPACE            NAME                                           READY   STATUS             RESTARTS   AGE
kube-system          pod/coredns-66f496764-9qhkf                    1/1     Running            0          20h
kube-system          pod/helm-install-traefik-ct278                 0/1     Completed          0          20h
kube-system          pod/svclb-traefik-v86f5                        3/3     Running            0          20h
kube-system          pod/traefik-d869575c8-mkfwx                    1/1     Running            0          20h
kube-system          pod/svclb-traefik-8xhqq                        3/3     Running            0          20h
local-path-storage   pod/local-path-provisioner-ccbdd96dc-zq2f4     1/1     Running            0          18h
kube-system          pod/tiller-deploy-6b9c575bfc-kfmr4             1/1     Running            0          17h
default              pod/svclb-test-wordpress-wordpress-gtjl2       0/2     Pending            0          17h
default              pod/svclb-test-wordpress-wordpress-wqlwk       0/2     Pending            0          17h
default              pod/test-wordpress-mariadb-0                   1/1     Running            0          17h
default              pod/test-wordpress-wordpress-5b7848c78-gbjw5   0/1     CrashLoopBackOff   179        17h

NAMESPACE     NAME                               TYPE           CLUSTER-IP      EXTERNAL-IP                  PORT(S)                                     AGE
kube-system   service/kube-dns                   ClusterIP      10.43.0.10      <none>                       53/UDP,53/TCP,9153/TCP                      20h
default       service/kubernetes                 ClusterIP      10.43.0.1       <none>                       443/TCP                                     20h
kube-system   service/traefik                    LoadBalancer   10.43.116.146   10.64.96.39,95.216.214.152   80:31212/TCP,443:31690/TCP,8080:31910/TCP   20h
kube-system   service/tiller-deploy              ClusterIP      10.43.172.201   <none>                       44134/TCP                                   17h
default       service/test-wordpress-mariadb     ClusterIP      10.43.127.174   <none>                       3306/TCP                                    17h
default       service/test-wordpress-wordpress   LoadBalancer   10.43.182.251   <pending>                    80:32594/TCP,443:32402/TCP                  17h

NAMESPACE     NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
kube-system   daemonset.apps/svclb-traefik                    2         2         2       2            2           <none>          20h
default       daemonset.apps/svclb-test-wordpress-wordpress   2         2         0       2            0           <none>          17h

NAMESPACE            NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
kube-system          deployment.apps/coredns                    1/1     1            1           20h
kube-system          deployment.apps/traefik                    1/1     1            1           20h
local-path-storage   deployment.apps/local-path-provisioner     1/1     1            1           18h
kube-system          deployment.apps/tiller-deploy              1/1     1            1           17h
default              deployment.apps/test-wordpress-wordpress   0/1     1            0           17h

NAMESPACE            NAME                                                 DESIRED   CURRENT   READY   AGE
kube-system          replicaset.apps/coredns-66f496764                    1         1         1       20h
kube-system          replicaset.apps/traefik-d869575c8                    1         1         1       20h
local-path-storage   replicaset.apps/local-path-provisioner-ccbdd96dc     1         1         1       18h
kube-system          replicaset.apps/tiller-deploy-6b9c575bfc             1         1         1       17h
default              replicaset.apps/test-wordpress-wordpress-5b7848c78   1         1         0       17h

NAMESPACE   NAME                                      READY   AGE
default     statefulset.apps/test-wordpress-mariadb   1/1     17h

NAMESPACE     NAME                             COMPLETIONS   DURATION   AGE
kube-system   job.batch/helm-install-traefik   1/1           26s        20h

➜  ~ kubectl logs pod/test-wordpress-wordpress-5b7848c78-gbjw5

Welcome to the Bitnami wordpress container
Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-wordpress
Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-wordpress/issues

WARN  ==> You set the environment variable ALLOW_EMPTY_PASSWORD=yes. For safety reasons, do not use this flag in a production environment.
nami    INFO  Initializing apache
nami    INFO  apache successfully initialized
nami    INFO  Initializing php
nami    INFO  php successfully initialized
nami    INFO  Initializing mysql-client
nami    INFO  mysql-client successfully initialized
nami    INFO  Initializing wordpress
mysql-c INFO  Trying to connect to MySQL server
Error executing 'postInstallation': Failed to connect to test-wordpress-mariadb:3306 after 36 tries
➜  ~

It's pending after like 17h but I also saw that the wordpress container crashed as it couldn't connect to the db.

boelenz commented 4 years ago

I have the same event happening. I cannot see what caused it.

zimme commented 4 years ago

~Pretty sure this has been resolved by #355 as the metsdata is properly fetched now when using hetzner as the datasource.~

Scratch that, just tested installing wordpress via a helmchart manifest file and it's pending.

recklessop commented 4 years ago

Im seeing the same thing on a LoadBalancer I just deployed....

djpbessems commented 4 years ago

Seeing the same on a freshly provisioned K3s cluster; I installed Longhorn (hoping to evaluate it/use it for NFS-mounts or just regular distributed block storage), but I'm seeing that the longhorn-frontend service stays on <pending>:

NAMESPACE         NAME                      TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                      AGE
default           kubernetes                ClusterIP      10.43.0.1       <none>           443/TCP                      19m
kube-system       kube-dns                  ClusterIP      10.43.0.10      <none>           53/UDP,53/TCP,9153/TCP       19m
kube-system       metrics-server            ClusterIP      10.43.198.102   <none>           443/TCP                      19m
default           traefik                   LoadBalancer   10.43.59.5      192.168.11.245   80:30249/TCP,443:32686/TCP   3m1s
longhorn-system   longhorn-backend          ClusterIP      10.43.129.62    <none>           9500/TCP                     93s
longhorn-system   longhorn-frontend         LoadBalancer   10.43.163.220   <pending>        80:30538/TCP

When I check the status of one of the pods, I see the following:

Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports.
Rojikku commented 4 years ago

I also have this issue. I am running a one-node kubernetes cluster, barely understand kubernetes, and am really just learning and playing around.

The solution is as follows: kubectl patch svc SERVICENAME -p '{"spec": {"type": "LoadBalancer", "externalIPs":["192.168.0.X"]}}'

Obviously replace the IP with the desired IP, and the SERVICENAME with the right name. You can get the right name from kubectl get svc I'm using this for my nginx ingress, because apparently I have to setup ingress in order to setup rancher. There are probably issues associated with this solution. For example, if my node at that IP goes down, what would happen?

djpbessems commented 4 years ago

I solved it by changing the service from type LoadBalancer to ClusterIP and then create an IngressRoute for Traefik to publish the dashboard.

Rojikku commented 4 years ago

I solved it by changing the service from type LoadBalancer to ClusterIP and then create an IngressRoute for Traefik to publish the dashboard.

I've been working on a better solution, and was working on setting up Traefik. Could you give me some more information on how you set that up?

EDIT: Currently got stable/traefik working by doing the patch command on it, and that seems to work fine, as far as I know.

djpbessems commented 4 years ago

In case you're still interested (this is for Traefik 2.x):

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: traefik-dashboard
  namespace: default
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`traefik.example.com`)
      kind: Rule
      services:
        - name: api@internal
          kind: TraefikService
  tls:
    certResolver: default
    options:
      name: default
    domains:
    - main: 'traefik.example.com'

Ofcourse you will also have to define TLSOption and certResolver

teknofile commented 4 years ago

I also have this issue. I am running a one-node kubernetes cluster, barely understand kubernetes, and am really just learning and playing around.

The solution is as follows: kubectl patch svc SERVICENAME -p '{"spec": {"type": "LoadBalancer", "externalIPs":["192.168.0.X"]}}'

Obviously replace the IP with the desired IP, and the SERVICENAME with the right name. You can get the right name from kubectl get svc I'm using this for my nginx ingress, because apparently I have to setup ingress in order to setup rancher. There are probably issues associated with this solution. For example, if my node at that IP goes down, what would happen?

How did you determine what IP address to give? I'm fairly certain that this error is because it is binding port 80 on the same IP address that traefik is already listening on. But I can't find out how k3s allocates EXTERNAL-IP's

Rojikku commented 4 years ago

I also have this issue. I am running a one-node kubernetes cluster, barely understand kubernetes, and am really just learning and playing around. The solution is as follows: kubectl patch svc SERVICENAME -p '{"spec": {"type": "LoadBalancer", "externalIPs":["192.168.0.X"]}}' Obviously replace the IP with the desired IP, and the SERVICENAME with the right name. You can get the right name from kubectl get svc I'm using this for my nginx ingress, because apparently I have to setup ingress in order to setup rancher. There are probably issues associated with this solution. For example, if my node at that IP goes down, what would happen?

How did you determine what IP address to give? I'm fairly certain that this error is because it is binding port 80 on the same IP address that traefik is already listening on. But I can't find out how k3s allocates EXTERNAL-IP's

So the machine itself has an IP, that you probably already know. That's the one I'm talking about.

Actually, my above solution is horrible wrong though, I have found. This is simply for the reason that K3os comes with Traefik installed under the kube-system namespace. As such, you shouldn't need to install any further services with external IPs. Just setup a service and an ingress route, and it should work out. At least, it has for me.

zimme commented 4 years ago

I believe this might have had something to do with that Hetzner didn't offer any load balancer as a service thing and k8s doesn't come with an implementation for LoadBalancer, it's something you need to provide using something like Metallb, etc. I do believe k3s comes with klipperlb that you can disable with --no-deploy servicelb.

Hetzner cloud might need one to use something like metallb for it to being able to resolve the ip.

Hetzner now seem to offer a load balancer as a service just like aws, gcloud, etc. do and that might be able to be used.

I might be completely off here as I'm pretty new to the k8s world.

mysticaltech commented 3 years ago

@zimme The Hetzner CCM is what offers that functionality.

bitmage commented 2 years ago

I was confused by this as well coming into k3os with a bare metal test lab. It turns out this "external IP pending forever" is exactly what one would expect in a cluster with no LoadBalancer.

This article describes the flow of a working setup with HTTP traffic coming in, going through MetalLB to Traefik, to the end service. In particular I found this insightful: "A Kubernetes LoadBalancer service is not a load balancer, it's a target for a load balancer, and typically this load balancer is external [to the cluster]."

This article explores multiple ways of solving load balancing depending on your environment and goals.

If I'm understanding correctly, in my case:

  1. MetalLB would look for any LoadBalancer service definitions (like the Traefik service).
  2. My router doesn't support BGP, so MetalLB would function in its L2 mode.
  3. MetalLB elects a node to act as the load balancer for that service.
  4. That node would send ARP packets to the router claiming an IPv4 address from the pool that MetalLB is configured with (and which I would exempt from DHCP assignments in the router). This IP will remain assigned to the service, and in case of a node failure would be reassigned to an available node.
  5. This gives me a static IP that I can port forward to. Traefik is listening on this node (and all nodes) on ports 80 and 443.
  6. Traefik looks at each incoming request, looks at its Ingress rules, and finds the appropriate service to forward to based on domain and URL path routing rules.

Looking for SSL? If you have a single instance of Traefik you can follow their docs. But K3os by default has Traefik configured for high availability (running on each node). As of 2.0 Traefik only supports HA/LetsEncrypt in their Enterprise version. So this will require additionally setting up a cert-manager service on your cluster, and configuring your Ingress rules to take advantage of them. You don't have to use Cloudflare DNS as the article suggests, but you may have to implement a webhook if your DNS provider is not supported.

It's a lot of steps to go through, and considerably more effort than I was expecting to get traffic into the cluster. Seems like there should be a guide for this linked to from the main k3os readme. I can't imagine someone wanting to setup a cluster and not caring about ingress. I'll try it out and post back here if I get it working.

bitmage commented 2 years ago

Ok, I got this working and it wasn't so bad.

  1. First, the existing Klipper LB is going to interfere with MetalLB. Where did that come from? I swear I had the pending IP issue before. Well anyway, let's get rid of Klipper by adjusting the k3s configuration. Why prefer MetalLB? I think mainly because the IP will get reassigned if a node goes down, so that seems a little more reliable.
    ssh rancher@labc1
    sudo su
    mount -o remount,rw /k3os/system
    vim /k3os/system/config.yaml
    reboot

Modify config.yaml:

k3s_args:
  - server
  - --disable=servicelb
  1. I installed MetalLB through helm:
    helm repo add metallb https://metallb.github.io/metallb
    helm install metallb metallb/metallb -f artifacts/metallb/values.yaml

values.yaml:

configInline:
  address-pools:
   - name: default
     protocol: layer2
     addresses:
     - 192.168.1.50-192.168.1.59
  1. You may need to delete the traefik service and re-apply:
    kubectl get svc traefik -o yaml > traefik.yaml
    # (modify the file as needed)
    kubectl delete -f traefik.yaml
    kubectl apply -f traefik.yaml
  2. You should see an IPAllocated message pop up in traefik service description:
    Type    Reason        Age   From                Message
    ----    ------        ----  ----                -------
    Normal  IPAllocated   21s   metallb-controller  Assigned IP ["192.168.1.50"]
    Normal  nodeAssigned  21s   metallb-speaker     announcing from node "laba9"
  3. At this point I put 192.168.1.50 mydomain.com into my /etc/hosts file and I was able to test with an nginx base image. Here's the yaml for that test server if you'd like to use this yourself. Just change Host towards the end.
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-dep
  namespace: experiment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello-dep
  template:
    metadata:
      labels:
        app: hello-dep
      name: hello
      namespace: experiment
    spec:
      containers:
        - name: hello
          image: nginx
          ports:
            - name: http
              containerPort: 80

---
apiVersion: v1
kind: Service
metadata:
  name: hello-service
  namespace: experiment
spec:
  selector:
    app: hello-dep
  ports:
    - protocol: TCP
      port: 80

---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: hello-ingress
  namespace: experiment
spec:
  entryPoints:
    - web
  routes:
  - match: Host(`mydomain.com`)
    kind: Rule
    services:
    - name: hello-service
      port: 80

That's it! Appears to be working from my end now. SSL is next up.

Ashkaan commented 1 year ago

This doesn't work for me. Klipper doesn't leave even though I have my config correct and service restarted. Any ideas as to how to get rid of klipper completely?

valerauko commented 10 months ago

For me the issue was that when upgrading to a newer version of k3s, at some point apparently the naming scheme changed.

Meaning I had a svclb-traefik (running rancher/klipper-lb:v0.2.0) and a svclb-traefik- (running rancher/klipper-lb:v0.4.4).

kubectl -n kube-system get daemonset

The latter stayed pending because obviously the old one (still functioning without issue) had the respective ports already taken on all nodes.

I resolved this by disabling the old DaemonSet. I first patched it to a non-existent node selector. Then I confirmed that the new DaemonSet came available on all nodes (and that my public sites were actually available), and went on to delete the old DaemonSet manually.