kubernetes-sigs / cloud-provider-kind

Cloud provider for KIND clusters
Apache License 2.0
143 stars 34 forks source link

cloud-provider-kind freezes my system when using istio #78

Closed iblancasa closed 3 months ago

iblancasa commented 3 months ago

The Istio documentation for kind points to Cloud Provider KIND to setup MetalLB.

I'm using kind 0.23.0 and cloud-provider-kind 0.1.0.

When I start a kind cluster with this configuration:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    image: kindest/node:v1.29.4@sha256:3abb816a5b1061fb15c6e9e60856ec40d56b7b52bcea5f5f1350bc6e2320b6f8
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP

Everything works properly when I start cloud-provider-kind. When I start istio version 1.22.0 with istioctl install -y, the system freezes. After that, I have to reboot.

I'm running Fedora release 39 (Thirty Nine).

iblancasa commented 3 months ago

I can provide more information but I'm not sure where to look.

BenTheElder commented 3 months ago

Everything works properly when I start cloud-provider-kind. When I start istio version 1.22.0 with istioctl install -y, the system freezes. After that, I have to reboot.

... does it freeze without cloud provider kind? this seems like an istio <> kind issue rather than cloud-provider-knid

The Istio documentation for kind points to Cloud Provider KIND to setup MetalLB.The Istio documentation for kind points to Cloud Provider KIND to setup MetalLB.

The pointer here links to a page in the kind docs that used to have a mettalb install but currently covers cloud-provider-kind, the istio docs should be corrected, if metalb is still desired something other link will need to be used, if not we should drop the reference to metallb.

cc @howardjohn

howardjohn commented 3 months ago

I think @craigbox or @danehans was looking into ^ already

iblancasa commented 3 months ago

... does it freeze without cloud provider kind? this seems like an istio <> kind issue rather than cloud-provider-knid

It doesn't freeze if I'm just using kind + cloud provider kind. As soon as Istio is started... everything freezes. If I use Istio + kind (no cloud provider kind) everything goes well. It only fails if I have kind + cloud provider kind + Istio.

I tried:

  1. Start kind, start cloud provider kind, start Istio
  2. Start kind, start Istio, start cloud provider kind

Same result.

The pointer here links to a page in the kind docs that used to have a mettalb install but currently covers cloud-provider-kind, the istio docs should be corrected, if metalb is still desired something other link will need to be used, if not we should drop the reference to metallb.

I reached that website from this: https://istio.io/latest/docs/tasks/traffic-management/ingress/ingress-control/. I can see how CLUSTER-IP is <pending> when running kubectl get svc "$INGRESS_NAME" -n "$INGRESS_NS". I'm running that command in a different terminal using watch. When I start cloud-provider-kind I can see how the status goes from <pending> to an IP. But later... everything freezes.

I was planning to use kind for a workshop but I'll need to switch to minikube or something else.

Anyway, let me know if there is any extra information I can provide to help fix the issue.

Thanks!

craigbox commented 3 months ago

Can you elaborate on "everything freezes"?

If your host system freezes and has to reboot, then it sounds like either a bug in Docker (unlikely) or a bug in cloud-provider-kind. I don't immediately see how it could be a bug in kind or Istio.

Installing Istio (or to be clear, a Gateway) is probably the first time that cloud-provider-kind sees a LoadBalancer that it needs to provision.

(It just so happens I tried cloud-provider-kind last week and it worked for me; yes, we are still referring to old MetalLB docs which were changed from underneath us, and I'll update the links next week.)

iblancasa commented 3 months ago

Can you elaborate on "everything freezes"?

I can do nothing with my machine. The system becomes unresponsive. The only thing I can do is to push the power button and wait until the laptop is forced to shut down.

If your host system freezes and has to reboot, then it sounds like either a bug in Docker (unlikely) or a bug in cloud-provider-kind. I don't immediately see how it could be a bug in kind or Istio.

I don't think it is a bug in Istio nor kind but it was the combination where I experienced the issue.

This is the information from my docker environment:

$ docker version
Client: Docker Engine - Community
 Version:           26.1.3
 API version:       1.45
 Go version:        go1.21.10
 Git commit:        b72abbb
 Built:             Thu May 16 08:35:25 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.3
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.10
  Git commit:       8e96db1
  Built:            Thu May 16 08:33:42 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.31
  GitCommit:        e377cd56a71523140ca6ae87e30244719194a521
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
craigbox commented 3 months ago

Can you log the cloud-provider-kind stdout to a file so you can get it after the reboot and attach it here?

iblancasa commented 3 months ago
I0530 13:24:40.077190   10025 controller.go:167] probe HTTP address https://127.0.0.1:41009
I0530 13:24:40.079794   10025 controller.go:88] Creating new cloud provider for cluster workshop
I0530 13:24:40.084009   10025 controller.go:95] Starting cloud controller for cluster workshop
I0530 13:24:40.084023   10025 node_controller.go:165] Sending events to api server.
I0530 13:24:40.084187   10025 controller.go:231] Starting service controller
I0530 13:24:40.084239   10025 shared_informer.go:311] Waiting for caches to sync for service
I0530 13:24:40.084440   10025 node_controller.go:174] Waiting for informer caches to sync
I0530 13:24:40.085664   10025 reflector.go:351] Caches populated for *v1.Node from k8s.io/client-go/informers/factory.go:159
I0530 13:24:40.085719   10025 reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go/informers/factory.go:159
I0530 13:24:40.184494   10025 shared_informer.go:318] Caches are synced for service
I0530 13:24:40.184581   10025 controller.go:733] Syncing backends for all LB services.
I0530 13:24:40.184601   10025 controller.go:737] Successfully updated 0 out of 0 load balancers to direct traffic to the updated set of nodes
I0530 13:24:40.184613   10025 instances.go:47] Check instance metadata for workshop-control-plane
I0530 13:24:40.184710   10025 controller.go:398] Ensuring load balancer for service istio-system/istio-ingressgateway
I0530 13:24:40.184794   10025 controller.go:954] Adding finalizer to service istio-system/istio-ingressgateway
I0530 13:24:40.184955   10025 event.go:376] "Event occurred" object="istio-system/istio-ingressgateway" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0530 13:24:40.203753   10025 loadbalancer.go:28] Ensure LoadBalancer cluster: workshop service: istio-ingressgateway
I0530 13:24:40.204765   10025 instances.go:75] instance metadata for workshop-control-plane: &cloudprovider.InstanceMetadata{ProviderID:"kind://workshop/kind/workshop-control-plane", InstanceType:"kind-node", NodeAddresses:[]v1.NodeAddress{v1.NodeAddress{Type:"Hostname", Address:"workshop-control-plane"}, v1.NodeAddress{Type:"InternalIP", Address:"172.18.0.3"}, v1.NodeAddress{Type:"InternalIP", Address:"fc00:f853:ccd:e793::3"}}, Zone:"", Region:""}
I0530 13:24:40.217251   10025 node_controller.go:267] Update 1 nodes status took 32.689405ms.
I0530 13:24:40.222627   10025 server.go:100] updating loadbalancer
I0530 13:24:40.222650   10025 proxy.go:126] address type Hostname, only InternalIP supported
I0530 13:24:40.222660   10025 proxy.go:126] address type Hostname, only InternalIP supported
I0530 13:24:40.222664   10025 proxy.go:126] address type Hostname, only InternalIP supported
I0530 13:24:40.222667   10025 proxy.go:140] haproxy config info: &{HealthCheckPort:10256 ServicePorts:map[IPv4_15021:{BindAddress:*:15021 Backends:map[workshop-control-plane:172.18.0.3:31025]} IPv4_443:{BindAddress:*:443 Backends:map[workshop-control-plane:172.18.0.3:31055]} IPv4_80:{BindAddress:*:80 Backends:map[workshop-control-plane:172.18.0.3:31209]}]}
I0530 13:24:40.222744   10025 proxy.go:155] updating loadbalancer with config 
global
  log /dev/log local0
  log /dev/log local1 notice
  daemon

resolvers docker
  nameserver dns 127.0.0.11:53

defaults
  log global
  mode tcp
  option dontlognull
  # TODO: tune these
  timeout connect 5000
  timeout client 50000
  timeout server 50000
  # allow to boot despite dns don't resolve backends
  default-server init-addr none

frontend IPv4_15021-frontend
  bind *:15021
  default_backend IPv4_15021-backend
  # reject connections if all backends are down
  tcp-request connection reject if { nbsrv(IPv4_15021-backend) lt 1 }

backend IPv4_15021-backend
  option httpchk GET /healthz
  server workshop-control-plane 172.18.0.3:31025 check port 10256 inter 5s fall 3 rise 1

frontend IPv4_443-frontend
  bind *:443
  default_backend IPv4_443-backend
  # reject connections if all backends are down
  tcp-request connection reject if { nbsrv(IPv4_443-backend) lt 1 }

backend IPv4_443-backend
  option httpchk GET /healthz
  server workshop-control-plane 172.18.0.3:31055 check port 10256 inter 5s fall 3 rise 1

frontend IPv4_80-frontend
  bind *:80
  default_backend IPv4_80-backend
  # reject connections if all backends are down
  tcp-request connection reject if { nbsrv(IPv4_80-backend) lt 1 }

backend IPv4_80-backend
  option httpchk GET /healthz
  server workshop-control-plane 172.18.0.3:31209 check port 10256 inter 5s fall 3 rise 1

I0530 13:24:40.261097   10025 proxy.go:163] restarting loadbalancer
I0530 13:24:40.287411   10025 server.go:116] get loadbalancer status
I0530 13:24:40.296570   10025 controller.go:995] Patching status for service istio-system/istio-ingressgateway
I0530 13:24:40.296641   10025 event.go:376] "Event occurred" object="istio-system/istio-ingressgateway" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
aojea commented 3 months ago

it will be good if you can run top or htop in parallel to see if some process has a problem and consumes all the CPU causing the freeze, for more advanced diagnostics you can follow these guidelines https://en.wikibooks.org/wiki/Linux_Guide/Freezes

iblancasa commented 3 months ago

@aojea I tried 4 times. The haproxy process appears and takes all the CPU.

BenTheElder commented 3 months ago

yes, we are still referring to old MetalLB docs which were changed from underneath us, and I'll update the links next week

yeah, sorry about that, more of an aside that we should follow up as I just noticed this impact.

The haproxy process appears and takes all the CPU.

... interesting, can you try with the latest code:

~go install sigs.k8s.io/cloud-provider-kind@main (not @latest which is the most recent tagged release)~

EDIT: there's now a release https://github.com/kubernetes-sigs/cloud-provider-kind/issues/78#issuecomment-2140441178(so go install sigs.k8s.io/cloud-provider-kind@latest or any of the other documented methods for currently installing)

since ~the last release~ v0.1.0 @aojea switched from haproxy to envoy amongst other changes ...

aojea commented 3 months ago

We have to cut a new release with envoy and UDP support

aojea commented 3 months ago

@iblancasa can you try with the new release https://github.com/kubernetes-sigs/cloud-provider-kind/releases/tag/v0.2.0?

iblancasa commented 3 months ago

It works fine! Thanks a lot :)

aojea commented 3 months ago

It works fine! Thanks a lot :)

Thanks for the feedback