Closed dduportal closed 1 year ago
On the terraform code, with "usual" PRs:
aws
provider needed by 19.1.x EKS moduleeks-public
was manually (terraform apply on my machine instead of CI) recreated Note that, with 19.x changes, the EKS clusters are now private by default.
hotfix for the cik8s cluster: https://github.com/jenkins-infra/aws/commit/27d4f746748edcdb3ba49643cae3d2d329fb3153
Status: 2 new problems to fix:
[10.0.0.38] - - [22/Dec/2022:14:47:42 +0000] "GET /.well-known/acme-challenge/<redacted> HTTP/1.1" 401 172 "http://repo.aws.jenkins.io/.well-known/acme-challenge/<redacted>" "cert-manager-challenges/v1.9.1 (linux/amd64) cert-manager/<redacted>" 377 0.000 [artifact-caching-proxy-artifact-caching-proxy-8080] - - - - <redacted>
It's weird: the /.well-known
location should not ask for authentication as per https://github.com/kubernetes/ingress-nginx/blob/f9cce5a4ed7ef372a18bc826e395ff5660b7a444/docs/user-guide/nginx-configuration/configmap.md#no-auth-locations
But since we define a custom configmap, it might be overwritten: https://github.com/jenkins-infra/kubernetes-management/blob/8c6d91f9a02048f3b9e8fb4a444106f5a08fcfe6/config/ext_public-nginx-ingress__common.yaml#L25-L36 🤔
Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf
that will probably need to be cleaned up if not used par jenkins-infra
Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf that will probably need to be cleaned up if not used par jenkins-infra
Checked during a team working session: this cluster is cik8s
(used by ci.jenkins.io for its build). We did not found any dangling resource
We had an issue with this cluster after the ingress rules where successfully updated with a valid certificate:
the public IP (the 3 public IPs associated to the 3 network zones of the public loadbalancer) weren't reachable at all (even from inside the cluster) but Kubernetes reported everything fine.
Here are my (raw) notes:
kubectl get svc -A
we can see that the public-nginx
Ingress controller has an AWS LoadBalancer associated with a valid DNS name (using the `dig public-nginx-ingress public-nginx-ingress-ingress-nginx-controller LoadBalancer 172.20.240.59 k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com 80:31868/TCP,443:32267/TCP 38d
kubectl -n artifact-caching-proxy exec -ti artifact-caching-proxy-0 -- sh
for instance), we try to reach both the private and public IP of the Public Service LB from above: curl -v 172.20.48.207 -o /dev/null
* Trying 172.20.48.207:80...
* Connected to 172.20.48.207 (172.20.48.207) port 80 (#0)
> GET / HTTP/1.1
> Host: 172.20.48.207
> User-Agent: curl/7.83.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.22.1
< Date: Fri, 27 Jan 2023 16:45:18 GMT
< Content-Type: text/html
< Content-Length: 1826
< Last-Modified: Mon, 23 Jan 2023 01:36:05 GMT
< Connection: keep-alive
< ETag: "63cde485-722"
< Accept-Ranges: bytes
<
* Connection #0 to host 172.20.48.207 left intact
curl -v k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com
* Trying 18.116.6.230:80...
# Stuck, need to wait 60s for timeout or issue a Ctrl-C cancellation
=> Private IP works well as expected, but the public IP(s) of the LB does not answer. It means the issue is with the LB itself.
Checking the LB in AWS UI (section EC2 -> "Load Balancing" -> "Load Balancers"), selecting the LB and in the tab "Listeners", click on thge "Default routing rule" of the "TCP:80" line (for example). The list of "target groups" (e.g. backend IP of the LB) is empty: that confirms the observed behavior.
This list of backend IP is specified by Kubernetes. In particular by the "AWS LB Controller" that we installed in this cluster. The role of this component is to scan the Kubernetes API for "Services" resources which kind are "LoadBalancer" and to create/update/delete them in the AWS API.
Checking the logs of this component (kubectl -n aws-load-balancer -l app.kubernetes.io/instance=aws-load-balancer-controller
) shows the error:
{"level":"error","ts":1674838128.6175954,"logger":"controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-publicng-publicng-7482972d25","namespace":"public-nginx-ingress","error":"expect exactly one securityGroup tagged with kubernetes.io/cluster/public-happy-polliwog for eni eni-0970f1ec0888c2d65, got: [sg-0c0d669a830f6e013 sg-0ca36e364f5491978] (clusterName: public-happy-polliwog)"}
kubernetes.io/cluster/public-happy-polliwog=true
of the security group for the cluster itself (eks-cluster-sg-public-happy-polliwog-1884802038
, usually the first in the list) and keeping this tag on the SG public-happy-polliwog-node
, because this 2nd SG is applied to the Kubernetes node VMs, which hosts the floating private IP of the Public Service LB.
=> After 5 min, the whole system is working again.=> Next step: find how to avoid this duplicate "tagging" in the jenkins-infra/aws terraform code.
New error: we cannot update the ACP statefulset:
Normal NotTriggerScaleUp 3m58s (x32461 over 3d18h) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
Warning FailedScheduling 2m30s (x5365 over 3d18h) default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict.
=> the autoscaler pod's logs for this cluster shows that autoscaling cannot be done:
I0207 10:58:37.369492 1 binder.go:791] "Could not get a CSINode object for the node" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" err="csinode.storage.k8s.io \"template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512\" not found"
I0207 10:58:37.369532 1 binder.go:811] "PersistentVolume and node mismatch for pod" PV="pvc-173ee3c5-22ec-4444-bee0-fe7b8ece01fa" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" pod="artifact-caching-proxy/artifact-caching-proxy-0" err="no matching NodeSelectorTerms"
I0207 10:58:37.369561 1 scale_up.go:300] Pod artifact-caching-proxy-0 can't be scheduled on eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0207 10:58:37.369819 1 scale_up.go:449] No pod can fit to eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44
I0207 10:58:37.369836 1 scale_up.go:453] No expansion options
It looks like https://github.com/kubernetes/autoscaler/issues/4811: the PVCs are only per AZ: the autoscaler seems to fail scaling nodes in the correct AZ so it's stuck 🤦
https://github.com/jenkins-infra/aws/pull/333 was merged: we are watching the effect
Temporary unblocking the kube management builds: https://github.com/jenkins-infra/kubernetes-management/commit/0288bb0748a85242f4bf1c126d121817d2cd1c1d (this commit will have to be reverted once repo.aws is fixed)
It seems we have found a working setup:
=> we have to:
autoscaler
configuration to be higly available AND taking in account the topology (it is not by default: https://github.com/kubernetes/autoscaler/blob/9158196a3c06ed754fc4333ac67417e66a4ec274/charts/cluster-autoscaler/values.yaml#L180)
cik8s
and eks-public
Closing as the problem is now fixed \o/
This issue is related to a major bump of the Terraform EKS module that we usein https://github.com/jenkins-infra/aws to manage the two EKS clusters on our infrastructure (
cik8s
andeks-public
).This issue is an audit trail after the whole operation.