Closed malanmurphy closed 1 year ago
Thanks for raising this issue, @malanmurphy. I believe the issue is unrelated to the instance type. In v1.23, EKS replaced the k8s CSI plugins with the Amazon EBS CSI driver. Now you have manually install the driver in order for the PVC to start.
If you run kubectl -n nginx-mesh get pvc
you'll notice that the spire-data-spire-server
PersistentVolumeClaim is stuck in a Pending state. To fix this, the CSI driver needs to be installed. Here's the docs on how to do this: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
Due to some account issues I wasn't able to verify this today using the t3.xlarge nodes, but I know we ran into this a while back when k8s v1.23 became available on EKS and installing the driver resolved it. Hopefully I'll be able to get my account working on Monday and be able to verify this on my end. If you try this in the meantime and it's still failing, could you please check kubectl -n nginx-mesh describe pvc
and see if there's any errors in there?
Thanks for the quick response @jbyers19! You are correct, pvc
returns Pending:
$ kc describe pvc -n nginx-mesh Name: spire-data-spire-server-0 Namespace: nginx-mesh StorageClass: gp2 Status: Pending Volume:
Labels: app.kubernetes.io/name=spire-server app.kubernetes.io/part-of=nginx-service-mesh Annotations: volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com Finalizers: [kubernetes.io/pvc-protection] Capacity:
Access Modes:
VolumeMode: Filesystem Used By: spire-server-0 Events: Type Reason Age From MessageNormal ExternalProvisioning 2m20s (x162 over 42m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
AWS' new CSI reqs are a deal breaker in my test env, unfortunately, but it reminded me that I'd forgotten the cardinal rule: When in doubt, deploy w/o persistent storage. :) That worked as expected and everything is up and running. Thanks for the reminder!
Great! That's a bummer you aren't able to get the driver installed, but I'm glad you were able to find a workaround. I ran into AWS IAM issues and wasn't able to get the driver installed on my cluster either. However, our CI/CD pipeline configures the driver for us in the EKS cluster we test in, so I can confirm NSM deploys with the driver installed.
Thanks @jbyers19. No objections, but while I have you... :)
I'm still seeing this event over and over, even though I've deployed with mtls.persistentStorage=off
:
$ kubectl get events -n nginx-mesh LAST SEEN TYPE REASON OBJECT MESSAGE 2m50s Normal ExternalProvisioning persistentvolumeclaim/spire-data-spire-server-0 waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
I've seen this event 2.2k times in the past 9 hours. Is it expected to keep trying even though I told NSM not to use persistentStorage?
Let me do some digging and get back to you. :) It's getting later in the day for me though so it won't be until tomorrow.
@malanmurphy, I installed the mesh with persistent storage disabled and didn't see any error events. The spire-data-spire-server-0
wasn't created either. Based on that event, I wonder if the PVC from a previous NSM deployment didn't get cleaned up. If you delete it those events should stop: kubectl delete pvc spire-data-spire-server-0 -n nginx-mesh
Here's the helm command I used for reference: helm install nsm nginx-stable/nginx-service-mesh --namespace nginx-mesh --create-namespace --set mtls.persistentStorage=off --wait
FYI @malanmurphy, the docs have been updated to include a note regarding the CSI driver for EKS. Thanks again for bringing this to our attention! https://docs.nginx.com/nginx-service-mesh/get-started/kubernetes-platform/persistent-storage/
That could be the issue with PVC @jbyers19. A helm delete
isn't currently removing the crds, those have to be removed manually after the fact (was going to file a bug against the docs to better call that out but haven't had a chance) so it's possible that the same could be happening with the original PVC. I'll give it a go tomorrow and let you know.
While helm is great at adding things to your cluster, it's pretty bad at getting them back out of it. 😄 Looks like, in addition to CRDs, helm does not delete PVs or PVCs. If you didn't manually delete the nginx-mesh
namespace, which helm doesn't delete for you either, then the PVC will still exist which is why those events kept happening.
The docs could use some improvement, but the need to manually remove the CRDs when using helm is called out in the docs.
If you don't have the helm charts downloaded, here's an easier way of deleting the CRDs:
kubectl delete crd -l app.kubernetes.io/part-of==nginx-service-mesh
Thanks for the docs update @jbyers19 on the CSI driver for EKS, and for the additional details. May be good to update the Helm uninstall docs with the cleaner CRD removal as well, that's a great and very clean way to clean up. IMO. I'm good if you want to close this issue.
Right on. Thanks, @malanmurphy! I'll go ahead and close this issue and make a note to update the docs regarding the CRD removal.
Describe the bug NSM fails to deploy in EKS clusters running t3.xlarge instance types. I've also tested t3a.xlarge with the same results; both families meet the min system reqs but t3a's run on AMD vs Intel. I've gone through the docs and there are no prep steps for EKS beyond Persistent storage, which I've verified:
$ kc get storageclass NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 20h
I've also changed volumeBindingMode to
Immediate
(see log details below) to rule that issue out and same result, same errors in logs, etc.To Reproduce Tested both Helm and
nginx-meshctl
with same failed results. Here's mynginx-meshctl
command:$ ./nginx-meshctl deploy --enable-udp=false deploy-grafana=false --enabled-namespaces=bookinfo --disable-auto-inject --mtls-trust-domain=nginx.mesh --nginx-log-format=json --mtls-mode=strict
--Helm-- $ helm install nsm nginx-stable/nginx-service-mesh --namespace nginx-mesh --create-namespace -f ./values-nsm.yaml --wait
...and Helm values file:
registry: server: "docker-registry.nginx.com/nsm" accessControlMode: "allow" environment: "kubernetes" enableUDP: false deployGrafana: false nginxErrorLogLevel: "warn" nginxLogFormat: "json" nginxLBMethod: "least_time" autoInjection: disable: true enabledNamespaces: [bookinfo,sock-shop] mtls: mode: "strict" caTTL: "720h" svidTTL: "1h" trustDomain: "nginx.mesh" persistentStorage: "on" spireServerKeyManager: "disk" caKeyType: "ec-p256"
Expected behavior NSM v1.7.0 is expected to deploy.
Your environment
node groups version: v1.25.6-eks-48e63af
Additional context Here are logs and details from everything I could think to capture:
nginx-meshctl api logs
$ kubectl logs pod/nginx-mesh-api-7b64b4798f-k2tx2 -n nginx-mesh I0324 17:44:49.737437 1 main.go:36]
Beginning startup process
W0324 17:44:49.738818 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to { "Addr": "/run/spire/sockets/agent.sock", "ServerName": "localhost", "Attributes": {}, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory" 2023/03/24 17:44:49 X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server. W0324 17:44:50.739713 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to { "Addr": "/run/spire/sockets/agent.sock", "ServerName": "localhost", "Attributes": {}, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory" 2023/03/24 17:44:50 X509SVIDClient cannot connect to the Spire agent: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/spire/sockets/agent.sock: connect: no such file or directory". For more information check the logs of the Spire agents and server.
helm install errors:
$ helm install nsm nginx-stable/nginx-service-mesh --namespace nginx-mesh --create-namespace --wait --debug install.go:194: [debug] Original chart version: "" install.go:211: [debug] CHART PATH: /Users/alan/Library/Caches/helm/repository/nginx-service-mesh-0.7.0.tgz
client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 1 resource(s) install.go:168: [debug] Clearing discovery cache wait.go:48: [debug] beginning wait for 7 resources with timeout of 1m0s client.go:133: [debug] creating 1 resource(s) client.go:133: [debug] creating 43 resource(s) wait.go:48: [debug] beginning wait for 43 resources with timeout of 5m0s ready.go:304: [debug] DaemonSet is not ready: nginx-mesh/spire-agent. 0 out of 2 expected pods are ready