Closed eirikur-grid closed 3 years ago
@viveklak This sounds a lot like #1502 -- any thoughts here?
Assigned myself. Will take a look. Thanks for reporting!
Chiming in here, i recreated my EKS cluster and facing the same
Same versions, EKS cluster version 1.19, but using the node automation api. Updating to 1.20 had no effect
I downgraded pulumi/kubernetes, same thing I then upgraded again, and it worked. So I guess it is not "broken", there is something else at play, rate-limiting maybe?
Destroyed it again, same thing, still waiting for the one time it works however doing a refresh updates the status, afterwards updates work again.
Had this happen again today. Saw very high CPU utilization from the 'pulumi-resource-kubernetes' process while waiting for the pulumi up command to finally time out.
I am seeing this on a Azure AKS cluster. It is making our pulumi pipelines very unstable as sometimes it works and sometimes it doesn't
@lkt82 and @roderik - could you confirm if you were seeing this with pulumi-kubernetes 3.4.0?
@eirikur-grid (and others) - are there additional deployments in the namespace with the stuck deployments not controlled by Pulumi? Could you provide an estimate of how many such deployments/pods (not controlled by pulumi/same namespace) are there?
Does the problem reduce/go away if the Pulumi controlled deployments are put in a dedicated namespace?
@viveklak
Yes we are using pulumi-kubernetes 3.4.0 in C#
And a little output as well
[1/2] Waiting for app ReplicaSet be marked available
[1/2] Waiting for app ReplicaSet be marked available (0/2 Pods available)
warning: [MinimumReplicasUnavailable] Deployment does not have minimum availability.
[1/2] Waiting for app ReplicaSet be marked available (1/2 Pods available)
error: 2 errors occurred:
I am also seeing some "Throttling request" to the controlplane endpoint in the log
@lkt82 Thanks. Actively looking into this. The throttling request may be a red herring but if it helps you temporarily - you can add the new skipAwait
flag to the helm chart.
Actually I do believe there is an impact from throttling requests. Looks like our change to fix #1502 causes higher likelihood of throttles to occur (likely because of quorum reads when setting all our watches with revision number?). Our await logic doesn't handle dropped watches and cause the CPU tight loop that @eirikur-grid saw. Probably what @lkt82 saw too.
We have another bug #1635 that makes partial failures hard to recover from.
In general, https://github.com/pulumi/pulumi-kubernetes/issues/1598 seems increasingly important to fix this the right way. @lblackstone and I will prioritize fixing that in short order. For the moment, we will revert #1596 and cut a hotfix to reduce the likelihood of this.
v3.4.1 is out with #1596 reverted. I am working on #1598 right now. In the meantime, could folks running into this issue try with 3.4.1 and report if things improve? Thanks for your patience.
@eirikur-grid (and others) - are there additional deployments in the namespace with the stuck deployments not controlled by Pulumi? Could you provide an estimate of how many such deployments/pods (not controlled by pulumi/same namespace) are there?
There are two of them; nginx-proxy and traefik-router, with 1 and 2 replicas respectively.
0 for me
@eirikur-grid @roderik thanks! Sounds good. I would expect things to get better with 3.4.1. Could you try it out and let us know?
@viveklak I just upgraded to pulumi-kubernetes 3.4.1 and attempted a deployment to our staging environment. Unfortunately, the issue is not resolved, at least not for me.
@viveklak Let me know if you'd like me to enable verbose logging/tracing of some sorts. I acknowledge that I would probably have to censor that for sensitive information before sending to you.
It seems that I am getting a better result. I will do some more testing in the next days
+ pulumi:pulumi:Stack ApplicationPlatform-prod creating I0627 06:49:30.473330 2261 request.go:655] Throttling request took 7.386626923s, request: GET:############/apis/scheduling.k8s.io/v1beta1?timeout=32s
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-aad-pod-identity/mic creating Retry #1; creation failed: no matches for kind "AzurePodIdentityException" in version "aadpodidentity.k8s.io/v1"
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-azureidentitybindings.aadpodidentity.k8s.io created
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-azurepodidentityexceptions.aadpodidentity.k8s.io created
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-azureidentities.aadpodidentity.k8s.io created
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-azureassignedidentities.aadpodidentity.k8s.io created
+ kubernetes:core/v1:ServiceAccount cluster-aad-pod-identity/aad-pod-identity-mic created
+ kubernetes:core/v1:ServiceAccount cluster-aad-pod-identity/aad-pod-identity-nmi created
+ kubernetes:apps/v1:Deployment cluster-aad-pod-identity/aad-pod-identity-mic creating [1/2] Waiting for app ReplicaSet be marked available (1/2 Pods available)
+ kubernetes:apps/v1:Deployment cluster-aad-pod-identity/aad-pod-identity-mic creating Deployment initialization complete
+ kubernetes:apps/v1:Deployment cluster-aad-pod-identity/aad-pod-identity-mic created Deployment initialization complete
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-kube-system/aks-addon-exception creating
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-aad-pod-identity/mic creating
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-aad-pod-identity/mic created
+ kubernetes:apiextensions.k8s.io/v1:CustomResourceDefinition cluster-kube-system/aks-addon-exception created
@eirikur-grid I am curious to hear in general how much load you are seeing on your api servers in general? Do you see frequent leader re-elections? In general, we don't seem to handle throttled or prematurely closed watches well at the moment so if you are consistently running into the high-cpu hang situation, that is pretty indicative of a closed watch from my experience. If so, you might have to use the skipAwait annotation for the moment. We are actively working on eliminating the use of low-level watches which should definitely help with this class of problems (#1639) but that needs some more baking/testing.
@lkt82 Thanks. Please keep us posted on your experience.
@viveklak
Do you see frequent leader re-elections?
In the past week, I can see two spurts of LeaderElection events for our staging environment. The latter occurred on Friday at 14:34 UTC, roughly 8 hours before I tested pulumi-kubernetes 3.4.1 for that environment.
I've successfully deployed twice to our production cluster today using v.3.4.1. There may be an improvement there over 3.4.0. While this is anecdotal, I have a feeling that I more frequently have issues deploying from home than from the office.
@eirikur-grid curious to see if 3.5.0 seems to unblock you?
@eirikur-grid curious to see if 3.5.0 seems to unblock you?
I've performed two deployments to our staging cluster and both went smoothly. v3.5.0 is looking 👌 so far.
I can say its works for me as well. Deletion however is extremely slow on v3.5.0 it I take take 10min more for the helm resources to be deleted.
I've recently had issues with pulumi hanging (or timing out) when attempting to deploy changes. Our stack has 8 deployments. Some of them get updated, others fail. The number varies. Usually 2-4 are successfully updated.
On the kubernetes side, it appears as if the deployment has been successful.
Versions info
OS: macos Pulumi CLI: 3.5.1 Python: 3.7.3 Python package versions
k8s: 1.17 on AWS EKS
Here's the output after a failed deployment:
While I was waiting for the deployment to complete, I ran
kubectl get deploy
andkubectl get rs