nginxinc / nginx-service-mesh

A service mesh powered by NGINX Plus to manage container traffic in Kubernetes environments.
https://docs.nginx.com/nginx-service-mesh
Apache License 2.0
93 stars 30 forks source link

NSM 1.7.0/1.6.0 Fails To Deploy in EKS on t3.xlarge Instances #237

Closed malanmurphy closed 1 year ago

malanmurphy commented 1 year ago

Describe the bug NSM fails to deploy in EKS clusters running t3.xlarge instance types. I've also tested t3a.xlarge with the same results; both families meet the min system reqs but t3a's run on AMD vs Intel. I've gone through the docs and there are no prep steps for EKS beyond Persistent storage, which I've verified:

$ kc get storageclass NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 20h

I've also changed volumeBindingMode to Immediate (see log details below) to rule that issue out and same result, same errors in logs, etc.

To Reproduce Tested both Helm and nginx-meshctl with same failed results. Here's my nginx-meshctl command:

$ ./nginx-meshctl deploy --enable-udp=false deploy-grafana=false --enabled-namespaces=bookinfo --disable-auto-inject --mtls-trust-domain=nginx.mesh --nginx-log-format=json --mtls-mode=strict

--Helm-- $ helm install nsm nginx-stable/nginx-service-mesh --namespace nginx-mesh --create-namespace -f ./values-nsm.yaml --wait

...and Helm values file:

registry: server: "docker-registry.nginx.com/nsm" accessControlMode: "allow" environment: "kubernetes" enableUDP: false deployGrafana: false nginxErrorLogLevel: "warn" nginxLogFormat: "json" nginxLBMethod: "least_time" autoInjection: disable: true enabledNamespaces: [bookinfo,sock-shop] mtls: mode: "strict" caTTL: "720h" svidTTL: "1h" trustDomain: "nginx.mesh" persistentStorage: "on" spireServerKeyManager: "disk" caKeyType: "ec-p256"

Expected behavior NSM v1.7.0 is expected to deploy.

Your environment

node groups version: v1.25.6-eks-48e63af

Additional context Here are logs and details from everything I could think to capture:

nginx-meshctl api logs

$ kubectl logs pod/nginx-mesh-api-7b64b4798f-k2tx2 -n nginx-mesh I0324 17:44:49.737437 1 main.go:36]

Beginning startup process

W0324 17:44:49.738818 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to { "Addr": "/run/spire/sockets/agent.sock", "ServerName": "localhost", "Attributes": {}, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory" 2023/03/24 17:44:49 X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server. W0324 17:44:50.739713 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to { "Addr": "/run/spire/sockets/agent.sock", "ServerName": "localhost", "Attributes": {}, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory" 2023/03/24 17:44:50 X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server.


helm install errors:

$ helm install nsm nginx-stable/nginx-service-mesh --namespace nginx-mesh --create-namespace --wait --debug install.go:194: [debug] Original chart version: "" install.go:211: [debug] CHART PATH: /Users/alan/Library/Caches/helm/repository/nginx-service-mesh-0.7.0.tgz

client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) install.go:168: [debug] Clearing discovery cache wait.go:48: [debug] beginning wait for 7 resources with timeout of 1m0s client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 43 resource(s) wait.go:48: [debug] beginning wait for 43 resources with timeout of 5m0s ready.go:304: [debug] DaemonSet is not ready: nginx-mesh/spire-agent. 0 out of 2 expected pods are ready

ready.go:304: [debug] DaemonSet is not ready: nginx-mesh/spire-agent. 0 out of 2 expected pods are ready Error: INSTALLATION FAILED: timed out waiting for the condition helm.go:84: [debug] timed out waiting for the condition INSTALLATION FAILED main.newInstallCmd.func2 helm.sh/helm/v3/cmd/helm/install.go:141 github.com/spf13/cobra.(*Command).execute github.com/spf13/cobra@v1.6.1/command.go:916 github.com/spf13/cobra.(*Command).ExecuteC github.com/spf13/cobra@v1.6.1/command.go:1044 github.com/spf13/cobra.(*Command).Execute github.com/spf13/cobra@v1.6.1/command.go:968 main.main helm.sh/helm/v3/cmd/helm/helm.go:83 runtime.main runtime/proc.go:250 runtime.goexit runtime/asm_arm64.s:1172 --- $ kc logs pod/nats-server-f46dfc64-9z7cd -n nginx-mesh -c nginx-mesh-cert-reloader-init 2023/03/24 18:04:06 X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server. 2023/03/24 18:04:07 X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server. 2023/03/24 18:04:09 X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server. $ kc logs pod/nats-server-f46dfc64-9z7cd -n nginx-mesh Defaulted container "nginx-mesh-cert-reloader" out of: nginx-mesh-cert-reloader, nats-server, nginx-mesh-cert-reloader-init (init) Error from server (BadRequest): container "nginx-mesh-cert-reloader" in pod "nats-server-f46dfc64-9z7cd" is waiting to start: PodInitializing $ kc logs -f pod/nginx-mesh-api-7b64b4798f-8sl4c -n nginx-mesh I0324 18:05:44.082189 1 main.go:36] =========================== Beginning startup process =========================== W0324 18:05:44.083709 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to { "Addr": "/run/spire/sockets/agent.sock", "ServerName": "localhost", "Attributes": {}, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory" 2023/03/24 18:05:44 X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server. --- Random errors seen while working with `kubectl`: $ kc get pods -n nginx-mesh E0324 11:16:30.169294 77719 memcache.go:287] couldn't get resource list for nsm.nginx.com/v1alpha1: the server is currently unable to handle the request E0324 11:16:30.213368 77719 memcache.go:287] couldn't get resource list for metrics.smi-spec.io/v1alpha1: the server is currently unable to handle the request --- Info: $ kc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE gp2 (default) kubernetes.io/aws-ebs Delete Immediate false 10m $ kc get pvc -n nginx-mesh NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE spire-data-spire-server-0 Pending gp2 4m9s% $ kc get pods -n nginx-mesh NAME READY STATUS RESTARTS AGE nats-server-f46dfc64-nbvxk 0/2 Init:0/1 2 (76s ago) 7m18s nginx-mesh-api-7b64b4798f-xmp5r 0/1 Running 2 (107s ago) 7m18s nginx-mesh-metrics-86489d7f94-p7lw2 0/1 Running 1 (2m17s ago) 7m18s spire-agent-c26tx 0/1 Init:0/1 0 7m18s spire-agent-jjjx7 0/1 Init:0/1 0 7m18s spire-agent-npczf 0/1 Init:0/1 0 7m18s spire-server-0 0/2 Pending 0 7m18s
jbyers19 commented 1 year ago

Thanks for raising this issue, @malanmurphy. I believe the issue is unrelated to the instance type. In v1.23, EKS replaced the k8s CSI plugins with the Amazon EBS CSI driver. Now you have manually install the driver in order for the PVC to start.

If you run kubectl -n nginx-mesh get pvc you'll notice that the spire-data-spire-server PersistentVolumeClaim is stuck in a Pending state. To fix this, the CSI driver needs to be installed. Here's the docs on how to do this: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html

Due to some account issues I wasn't able to verify this today using the t3.xlarge nodes, but I know we ran into this a while back when k8s v1.23 became available on EKS and installing the driver resolved it. Hopefully I'll be able to get my account working on Monday and be able to verify this on my end. If you try this in the meantime and it's still failing, could you please check kubectl -n nginx-mesh describe pvc and see if there's any errors in there?

malanmurphy commented 1 year ago

Thanks for the quick response @jbyers19! You are correct, pvc returns Pending:

$ kc describe pvc -n nginx-mesh Name: spire-data-spire-server-0 Namespace: nginx-mesh StorageClass: gp2 Status: Pending Volume:
Labels: app.kubernetes.io/name=spire-server app.kubernetes.io/part-of=nginx-service-mesh Annotations: volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com Finalizers: [kubernetes.io/pvc-protection] Capacity:
Access Modes:
VolumeMode: Filesystem Used By: spire-server-0 Events: Type Reason Age From Message

Normal ExternalProvisioning 2m20s (x162 over 42m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator

AWS' new CSI reqs are a deal breaker in my test env, unfortunately, but it reminded me that I'd forgotten the cardinal rule: When in doubt, deploy w/o persistent storage. :) That worked as expected and everything is up and running. Thanks for the reminder!

jbyers19 commented 1 year ago

Great! That's a bummer you aren't able to get the driver installed, but I'm glad you were able to find a workaround. I ran into AWS IAM issues and wasn't able to get the driver installed on my cluster either. However, our CI/CD pipeline configures the driver for us in the EKS cluster we test in, so I can confirm NSM deploys with the driver installed.

246 was created to add a note about this to our docs. If there's no objections, I'll close this issue once that's merged. :)

malanmurphy commented 1 year ago

Thanks @jbyers19. No objections, but while I have you... :)

I'm still seeing this event over and over, even though I've deployed with mtls.persistentStorage=off:

$ kubectl get events -n nginx-mesh LAST SEEN TYPE REASON OBJECT MESSAGE 2m50s Normal ExternalProvisioning persistentvolumeclaim/spire-data-spire-server-0 waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator

I've seen this event 2.2k times in the past 9 hours. Is it expected to keep trying even though I told NSM not to use persistentStorage?

jbyers19 commented 1 year ago

Let me do some digging and get back to you. :) It's getting later in the day for me though so it won't be until tomorrow.

jbyers19 commented 1 year ago

@malanmurphy, I installed the mesh with persistent storage disabled and didn't see any error events. The spire-data-spire-server-0 wasn't created either. Based on that event, I wonder if the PVC from a previous NSM deployment didn't get cleaned up. If you delete it those events should stop: kubectl delete pvc spire-data-spire-server-0 -n nginx-mesh

Here's the helm command I used for reference: helm install nsm nginx-stable/nginx-service-mesh --namespace nginx-mesh --create-namespace --set mtls.persistentStorage=off --wait

jbyers19 commented 1 year ago

FYI @malanmurphy, the docs have been updated to include a note regarding the CSI driver for EKS. Thanks again for bringing this to our attention! https://docs.nginx.com/nginx-service-mesh/get-started/kubernetes-platform/persistent-storage/

malanmurphy commented 1 year ago

That could be the issue with PVC @jbyers19. A helm delete isn't currently removing the crds, those have to be removed manually after the fact (was going to file a bug against the docs to better call that out but haven't had a chance) so it's possible that the same could be happening with the original PVC. I'll give it a go tomorrow and let you know.

jbyers19 commented 1 year ago

While helm is great at adding things to your cluster, it's pretty bad at getting them back out of it. 😄 Looks like, in addition to CRDs, helm does not delete PVs or PVCs. If you didn't manually delete the nginx-mesh namespace, which helm doesn't delete for you either, then the PVC will still exist which is why those events kept happening.

The docs could use some improvement, but the need to manually remove the CRDs when using helm is called out in the docs.

If you don't have the helm charts downloaded, here's an easier way of deleting the CRDs:

kubectl delete crd -l app.kubernetes.io/part-of==nginx-service-mesh
malanmurphy commented 1 year ago

Thanks for the docs update @jbyers19 on the CSI driver for EKS, and for the additional details. May be good to update the Helm uninstall docs with the cleaner CRD removal as well, that's a great and very clean way to clean up. IMO. I'm good if you want to close this issue.

jbyers19 commented 1 year ago

Right on. Thanks, @malanmurphy! I'll go ahead and close this issue and make a note to update the docs regarding the CRD removal.