Bump the terraform module for AWS EKS (and consequences)

dduportal commented 1 year ago

This issue is related to a major bump of the Terraform EKS module that we usein https://github.com/jenkins-infra/aws to manage the two EKS clusters on our infrastructure (cik8s and eks-public).

This issue is an audit trail after the whole operation.

dduportal commented 1 year ago

On the terraform code, with "usual" PRs:

https://github.com/jenkins-infra/aws/pull/309 => required to get the latest terraform aws provider needed by 19.1.x EKS module
Preparing the upgrade required disabling the "default" node SGs: https://github.com/jenkins-infra/aws/pull/310
Bumped the module to 19.3.1: https://github.com/jenkins-infra/aws/pull/304
hotfix: https://github.com/jenkins-infra/aws/commit/7d4303bc3a0f6f13e5df2928a26e982e032b2851 (security groups were duplicated causing a cluster re-creaction for eks-public)
Cluster eks-publicwas manually (terraform apply on my machine instead of CI) recreated
https://github.com/jenkins-infra/aws/commit/0d7f2935072813821695a1d27dcf3f1bd136259d to ensure that the production terraform user (used from infra.ci) can operate the cluster, even if cloudbees-jenkins IAM account owns the new cluster (created on my machine with my IAM credentials),

dduportal commented 1 year ago

Note that, with 19.x changes, the EKS clusters are now private by default.

hotfix for the cik8s cluster: https://github.com/jenkins-infra/aws/commit/27d4f746748edcdb3ba49643cae3d2d329fb3153

dduportal commented 1 year ago

eks-public update on the kubernetes management: https://github.com/jenkins-infra/kubernetes-management/pull/3366 (bootstrapped services prior to the PR)
Updated the kubeconfig for the new EKS cluster: https://github.com/jenkins-infra/charts-secrets/commit/b5d45b3625efdad2fc027cb5b59ab3c254e2266b
Bumped module to 19.4.0: https://github.com/jenkins-infra/aws/pull/311

dduportal commented 1 year ago

Status: 2 new problems to fix:

Certificate for https://repo.aws.jenkins.io/ is not issued. Gotta check cert-manager (is it rate-limited by LE?)
the terratests fails with the latest EKS module (see main branch build of jenkins-infra/aws).

dduportal commented 1 year ago

Build fixed in https://github.com/jenkins-infra/aws/pull/312
The LE certificate is not emitted because it receieves HTTP/401 (from the public ingress access logs on eks-public):

[10.0.0.38] - - [22/Dec/2022:14:47:42 +0000] "GET /.well-known/acme-challenge/<redacted> HTTP/1.1" 401 172 "http://repo.aws.jenkins.io/.well-known/acme-challenge/<redacted>" "cert-manager-challenges/v1.9.1 (linux/amd64) cert-manager/<redacted>" 377 0.000 [artifact-caching-proxy-artifact-caching-proxy-8080] - - - - <redacted>

It's weird: the /.well-known location should not ask for authentication as per https://github.com/kubernetes/ingress-nginx/blob/f9cce5a4ed7ef372a18bc826e395ff5660b7a444/docs/user-guide/nginx-configuration/configmap.md#no-auth-locations

But since we define a custom configmap, it might be overwritten: https://github.com/jenkins-infra/kubernetes-management/blob/8c6d91f9a02048f3b9e8fb4a444106f5a08fcfe6/config/ext_public-nginx-ingress__common.yaml#L25-L36 🤔

smerle33 commented 1 year ago

Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf that will probably need to be cleaned up if not used par jenkins-infra

dduportal commented 1 year ago

For info: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2337

dduportal commented 1 year ago

https://github.com/jenkins-infra/kubernetes-management/pull/3517 => cert-manager version is now tracked and up to date on the eks-cluster
https://github.com/jenkins-infra/kubernetes-management/pull/3538
- Applied manually to unblock the LetsEncrypt challenge validation with HTTP-01
- Possible alternative we considered (for future reader if the problem happens): switch to DNS-01
Certificate generated, ACP is working (confirmed in #3302 )

Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf that will probably need to be cleaned up if not used par jenkins-infra

Checked during a team working session: this cluster is cik8s (used by ci.jenkins.io for its build). We did not found any dangling resource

dduportal commented 1 year ago

We had an issue with this cluster after the ingress rules where successfully updated with a valid certificate:

the public IP (the 3 public IPs associated to the 3 network zones of the public loadbalancer) weren't reachable at all (even from inside the cluster) but Kubernetes reported everything fine.

Here are my (raw) notes:

With the command kubectl get svc -A we can see that the public-nginx Ingress controller has an AWS LoadBalancer associated with a valid DNS name (using the `dig we see 3 records to the 3 public IPs):

public-nginx-ingress     public-nginx-ingress-ingress-nginx-controller             LoadBalancer   172.20.240.59    k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com   80:31868/TCP,443:32267/TCP   38d

From any pod of the cluster (command kubectl -n artifact-caching-proxy exec -ti artifact-caching-proxy-0 -- sh for instance), we try to reach both the private and public IP of the Public Service LB from above:

curl -v 172.20.48.207 -o /dev/null
*   Trying 172.20.48.207:80...
* Connected to 172.20.48.207 (172.20.48.207) port 80 (#0)
> GET / HTTP/1.1
> Host: 172.20.48.207
> User-Agent: curl/7.83.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.22.1
< Date: Fri, 27 Jan 2023 16:45:18 GMT
< Content-Type: text/html
< Content-Length: 1826
< Last-Modified: Mon, 23 Jan 2023 01:36:05 GMT
< Connection: keep-alive
< ETag: "63cde485-722"
< Accept-Ranges: bytes
< 
* Connection #0 to host 172.20.48.207 left intact

curl -v k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com
*   Trying 18.116.6.230:80...

# Stuck, need to wait 60s for timeout or issue a Ctrl-C cancellation

=> Private IP works well as expected, but the public IP(s) of the LB does not answer. It means the issue is with the LB itself.

Checking the LB in AWS UI (section EC2 -> "Load Balancing" -> "Load Balancers"), selecting the LB and in the tab "Listeners", click on thge "Default routing rule" of the "TCP:80" line (for example). The list of "target groups" (e.g. backend IP of the LB) is empty: that confirms the observed behavior.
This list of backend IP is specified by Kubernetes. In particular by the "AWS LB Controller" that we installed in this cluster. The role of this component is to scan the Kubernetes API for "Services" resources which kind are "LoadBalancer" and to create/update/delete them in the AWS API. Checking the logs of this component (kubectl -n aws-load-balancer -l app.kubernetes.io/instance=aws-load-balancer-controller) shows the error:

{"level":"error","ts":1674838128.6175954,"logger":"controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-publicng-publicng-7482972d25","namespace":"public-nginx-ingress","error":"expect exactly one securityGroup tagged with kubernetes.io/cluster/public-happy-polliwog for eni eni-0970f1ec0888c2d65, got: [sg-0c0d669a830f6e013 sg-0ca36e364f5491978] (clusterName: public-happy-polliwog)"}

Hotfix: From the AWS UI, removing the tag kubernetes.io/cluster/public-happy-polliwog=true of the security group for the cluster itself (eks-cluster-sg-public-happy-polliwog-1884802038, usually the first in the list) and keeping this tag on the SG public-happy-polliwog-node, because this 2nd SG is applied to the Kubernetes node VMs, which hosts the floating private IP of the Public Service LB. => After 5 min, the whole system is working again.

=> Next step: find how to avoid this duplicate "tagging" in the jenkins-infra/aws terraform code.

dduportal commented 1 year ago

Ref. https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/faq.md#i-received-an-error-expect-exactly-one-securitygroup-tagged-with-kubernetesioclustername-

dduportal commented 1 year ago

New error: we cannot update the ACP statefulset:

 Normal   NotTriggerScaleUp  3m58s (x32461 over 3d18h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
  Warning  FailedScheduling   2m30s (x5365 over 3d18h)   default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict.

=> the autoscaler pod's logs for this cluster shows that autoscaling cannot be done:

I0207 10:58:37.369492       1 binder.go:791] "Could not get a CSINode object for the node" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" err="csinode.storage.k8s.io \"template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512\" not found"
I0207 10:58:37.369532       1 binder.go:811] "PersistentVolume and node mismatch for pod" PV="pvc-173ee3c5-22ec-4444-bee0-fe7b8ece01fa" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" pod="artifact-caching-proxy/artifact-caching-proxy-0" err="no matching NodeSelectorTerms"
I0207 10:58:37.369561       1 scale_up.go:300] Pod artifact-caching-proxy-0 can't be scheduled on eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0207 10:58:37.369819       1 scale_up.go:449] No pod can fit to eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44
I0207 10:58:37.369836       1 scale_up.go:453] No expansion options

It looks like https://github.com/kubernetes/autoscaler/issues/4811: the PVCs are only per AZ: the autoscaler seems to fail scaling nodes in the correct AZ so it's stuck 🤦

dduportal commented 1 year ago

https://github.com/jenkins-infra/aws/pull/333 was merged: we are watching the effect

dduportal commented 1 year ago

Temporary unblocking the kube management builds: https://github.com/jenkins-infra/kubernetes-management/commit/0288bb0748a85242f4bf1c126d121817d2cd1c1d (this commit will have to be reverted once repo.aws is fixed)

dduportal commented 1 year ago

It seems we have found a working setup:

Sources of informations:
- https://aws.amazon.com/blogs/containers/amazon-eks-cluster-multi-zone-auto-scaling-groups/
- https://stackoverflow.com/questions/51946393/kubernetes-pod-warning-1-nodes-had-volume-node-affinity-conflict

=> we have to:

Define custom storage class with topology awareness (despite what AWS documentations says, the CSI driver does not seem to automatically generates expected classes).
- Nice to have: add retain and delete classes
- Scope: automate in jenkins-infra/aws Terraform project, like it was done for jenkins-infra/azure
Update AWS autoscaler configuration to be higly available AND taking in account the topology (it is not by default: https://github.com/kubernetes/autoscaler/blob/9158196a3c06ed754fc4333ac67417e66a4ec274/charts/cluster-autoscaler/values.yaml#L180)
- Scope: jenkins-infra/kubernetes-management, for both cik8s and eks-public
Cleanup the addition node pools added yesterday: autoscaler manages with the actual node pool (let's keep it simple)
- We are using a node pool spanning across multiple AZs: it's worth switching to one of the 3 new ones that is in the same AZ. Tests in progress

dduportal commented 1 year ago

Closing as the problem is now fixed \o/

jenkins-infra / helpdesk

Bump the terraform module for AWS EKS (and consequences) #3305