Closed zbuchheit closed 7 months ago
Hey @zbuchheit. Thanks for opening the issue. We can't do much without a repro but please ping me in Slack if it becomes urgent.
@t0yv0 You have been paying attention to timeout related issues. As a shot in the dark, does any obvious fix stand out here.
I appreciate the quick response. I will see if I can get a repro going. I tried digging through the code and the upstream provider, but I am not entirely sure where the 20m is coming from. From what I saw on the upstream provider the default timeout was 60m, but admittedly I could be conflating something here.
I was able to test some scenarios so far. I set the custom timeout on an update to 5s and confirmed to get
* waiting for EKS Node Group (zbuchheit-cluster-eksCluster-ece8aec:zbuchheit-nodegroup) version update (79d1faa4-1bd7-3f67-9381-da723ddd202a): timeout while waiting for state to become 'Successful' (last state: 'InProgress', timeout: 5s)
I also set the timeout to an update to 21m hoping it would trigger on 20m for some other reason but get a 21m timeout as expected.
* waiting for EKS Node Group (zbuchheit-cluster-eksCluster-e4babc5:zbuchheit-nodegroup) version update (8537b8d8-1b9f-3a49-8884-f362bcf9539f): timeout while waiting for state to become 'Successful' (last state: 'InProgress', timeout: 21m0s)
The user that reported the issue is using MLCs so I am not sure if there is something there that could spark an idea of the source.
I did confirm from their logs that their resource has customTimeouts=update:"4h"
set
From the logs:
2024-02-16 18:11:09 INFO --> aws@v6.21.0
But it doesn't list pulumi-eks version which could be interesting here as well.
It could be we are still affected by https://github.com/pulumi/pulumi-eks/issues/393 where the custom timeout doesn't make it to the AWS provider at all.
Or we could be affected by https://github.com/pulumi/pulumi-terraform-bridge/issues/1651 where AWS would ignore the custom timeout for update.
When reproducing it would be great to match provider versions exactly with the customer's scenario.
I don't believe they are using pulumi-eks
at all just using aws.eks.nodeGroup
resource from the pulumi-aws
package. It is inside an MLC though similar to pulumi-eks
ah interesting, on 6.18.0 it worked, but on 6.21.0 i have a update set to 21m and it hits a timeout at 20m. So it looks like a possible regression between 6.18.0 and 6.21.0
@zbuchheit Are you able to share the code you used to find the regression?
If it helps, this is our code for an aws.eks.NodeGroup
resource within an MLC component that has the same issue. The custom timeout is set directly on the resource, not passed in to the component or set on a parent.
private createNodeGroup = (): aws.eks.NodeGroup => {
return new aws.eks.NodeGroup(
this.args.nodeGroupName,
{
nodeGroupNamePrefix: `${this.args.nodeGroupName}-`,
clusterName: this.args.clusterName,
subnetIds: this.args.privateSubnetIds,
nodeRoleArn: this.nodeGroupRole.arn,
version: this.args.kubernetesVersion,
amiType: "BOTTLEROCKET_x86_64",
releaseVersion: this.args.bottlerocketVersion,
diskSize: this.args.diskSizeGiB,
instanceTypes: this.args.instanceTypes,
capacityType: "ON_DEMAND",
scalingConfig: {
desiredSize: this.args.desiredSize,
maxSize: this.args.maxSize,
minSize: this.args.minSize,
},
updateConfig: {
maxUnavailable: this.args.maxUnavailable,
},
tags: {
...this.args.tags,
},
},
{
parent: this,
dependsOn: this.nodeGroupRole,
ignoreChanges: ["scalingConfig.desiredSize"],
customTimeouts: { update: "90m" },
}
);
};
With 5.41.0
of @pulumi/aws
in the provider's package.json
it worked as expected. We needed to add the customTimeouts: { update: "90m" }
option to support long running roll outs when upgrading the releaseVersion
and were running happily with that for a while.
Since we bumped to 6.23.0
last week we see this same behaviour of a hidden 20m
timeout, despite the customTimeout
.
Thanks so much for the details. I can reproduce and bisect.
|--------+------|
| master | bad |
| 6.25.0 | ? |
| 6.24.2 | ? |
| 6.24.0 | ? |
| 6.23.0 | ? |
| 6.22.2 | ? |
| 6.22.1 | ? |
| 6.22.0 | bad |
| 6.21.0 | ? |
| 6.20.1 | ? |
| 6.19.0 | bad |
| 6.18.2 | good |
| 6.18.1 | good |
| 6.18.0 | good |
This regression was introduced in https://github.com/pulumi/pulumi-aws/pull/3333 Upgrading pulumi-terraform-bridge from v3.71.0 to v3.72.0.
More particularly this PR is the source of regression: https://github.com/pulumi/pulumi-terraform-bridge/pull/1648
Reverting is coupled to other changes but I will attempt to fix forward and have an update here shortly.
What happened?
When using aws.eks.NodeGroup with a customTimeout on update set to 4h I hit a timeout error of
waiting for EKS Node Group ... timeout while waiting for state to become 'Successful' (last state: 'InProgress', timeout: 20m0s)
Example
Not sure a repro is easily produced here but I will attempt to provide one.
Output of
pulumi about
N/A
Additional context
It is confusing where this 20m0s comes from as I believe the upstream provider sets this to 1h by default.
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).