jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

Bump the terraform module for AWS EKS (and consequences) #3305

Closed dduportal closed 1 year ago

dduportal commented 1 year ago

This issue is related to a major bump of the Terraform EKS module that we usein https://github.com/jenkins-infra/aws to manage the two EKS clusters on our infrastructure (cik8s and eks-public).

This issue is an audit trail after the whole operation.

dduportal commented 1 year ago

On the terraform code, with "usual" PRs:

dduportal commented 1 year ago

Note that, with 19.x changes, the EKS clusters are now private by default.

hotfix for the cik8s cluster: https://github.com/jenkins-infra/aws/commit/27d4f746748edcdb3ba49643cae3d2d329fb3153

dduportal commented 1 year ago
dduportal commented 1 year ago

Status: 2 new problems to fix:

dduportal commented 1 year ago
[10.0.0.38] - - [22/Dec/2022:14:47:42 +0000] "GET /.well-known/acme-challenge/<redacted> HTTP/1.1" 401 172 "http://repo.aws.jenkins.io/.well-known/acme-challenge/<redacted>" "cert-manager-challenges/v1.9.1 (linux/amd64) cert-manager/<redacted>" 377 0.000 [artifact-caching-proxy-artifact-caching-proxy-8080] - - - - <redacted>

It's weird: the /.well-known location should not ask for authentication as per https://github.com/kubernetes/ingress-nginx/blob/f9cce5a4ed7ef372a18bc826e395ff5660b7a444/docs/user-guide/nginx-configuration/configmap.md#no-auth-locations

But since we define a custom configmap, it might be overwritten: https://github.com/jenkins-infra/kubernetes-management/blob/8c6d91f9a02048f3b9e8fb4a444106f5a08fcfe6/config/ext_public-nginx-ingress__common.yaml#L25-L36 🤔

smerle33 commented 1 year ago

Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf that will probably need to be cleaned up if not used par jenkins-infra

dduportal commented 1 year ago

For info: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/2337

dduportal commented 1 year ago

Just discovered a second EKS cluster named : jenkins-infra-eks-ENRZrfwf that will probably need to be cleaned up if not used par jenkins-infra

Checked during a team working session: this cluster is cik8s (used by ci.jenkins.io for its build). We did not found any dangling resource

dduportal commented 1 year ago

We had an issue with this cluster after the ingress rules where successfully updated with a valid certificate:

the public IP (the 3 public IPs associated to the 3 network zones of the public loadbalancer) weren't reachable at all (even from inside the cluster) but Kubernetes reported everything fine.

Here are my (raw) notes:

public-nginx-ingress     public-nginx-ingress-ingress-nginx-controller             LoadBalancer   172.20.240.59    k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com   80:31868/TCP,443:32267/TCP   38d
curl -v 172.20.48.207 -o /dev/null
*   Trying 172.20.48.207:80...
* Connected to 172.20.48.207 (172.20.48.207) port 80 (#0)
> GET / HTTP/1.1
> Host: 172.20.48.207
> User-Agent: curl/7.83.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.22.1
< Date: Fri, 27 Jan 2023 16:45:18 GMT
< Content-Type: text/html
< Content-Length: 1826
< Last-Modified: Mon, 23 Jan 2023 01:36:05 GMT
< Connection: keep-alive
< ETag: "63cde485-722"
< Accept-Ranges: bytes
< 
* Connection #0 to host 172.20.48.207 left intact
curl -v k8s-publicng-publicng-f7332522a1-59fde896b2eb752b.elb.us-east-2.amazonaws.com
*   Trying 18.116.6.230:80...

# Stuck, need to wait 60s for timeout or issue a Ctrl-C cancellation

=> Private IP works well as expected, but the public IP(s) of the LB does not answer. It means the issue is with the LB itself.

{"level":"error","ts":1674838128.6175954,"logger":"controller.targetGroupBinding","msg":"Reconciler error","reconciler group":"elbv2.k8s.aws","reconciler kind":"TargetGroupBinding","name":"k8s-publicng-publicng-7482972d25","namespace":"public-nginx-ingress","error":"expect exactly one securityGroup tagged with kubernetes.io/cluster/public-happy-polliwog for eni eni-0970f1ec0888c2d65, got: [sg-0c0d669a830f6e013 sg-0ca36e364f5491978] (clusterName: public-happy-polliwog)"}

=> Next step: find how to avoid this duplicate "tagging" in the jenkins-infra/aws terraform code.

dduportal commented 1 year ago

Ref. https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/faq.md#i-received-an-error-expect-exactly-one-securitygroup-tagged-with-kubernetesioclustername-

dduportal commented 1 year ago

New error: we cannot update the ACP statefulset:

 Normal   NotTriggerScaleUp  3m58s (x32461 over 3d18h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
  Warning  FailedScheduling   2m30s (x5365 over 3d18h)   default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict.

=> the autoscaler pod's logs for this cluster shows that autoscaling cannot be done:

I0207 10:58:37.369492       1 binder.go:791] "Could not get a CSINode object for the node" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" err="csinode.storage.k8s.io \"template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512\" not found"
I0207 10:58:37.369532       1 binder.go:811] "PersistentVolume and node mismatch for pod" PV="pvc-173ee3c5-22ec-4444-bee0-fe7b8ece01fa" node="template-node-for-eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44-2880955131433950512" pod="artifact-caching-proxy/artifact-caching-proxy-0" err="no matching NodeSelectorTerms"
I0207 10:58:37.369561       1 scale_up.go:300] Pod artifact-caching-proxy-0 can't be scheduled on eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0207 10:58:37.369819       1 scale_up.go:449] No pod can fit to eks-eks-public-linux-2022121918373236600000000e-1ac295d6-a031-bacd-9366-b618591cac44
I0207 10:58:37.369836       1 scale_up.go:453] No expansion options

It looks like https://github.com/kubernetes/autoscaler/issues/4811: the PVCs are only per AZ: the autoscaler seems to fail scaling nodes in the correct AZ so it's stuck 🤦

dduportal commented 1 year ago

https://github.com/jenkins-infra/aws/pull/333 was merged: we are watching the effect

dduportal commented 1 year ago

Temporary unblocking the kube management builds: https://github.com/jenkins-infra/kubernetes-management/commit/0288bb0748a85242f4bf1c126d121817d2cd1c1d (this commit will have to be reverted once repo.aws is fixed)

dduportal commented 1 year ago

It seems we have found a working setup:

=> we have to:

dduportal commented 1 year ago

Closing as the problem is now fixed \o/