NodeGroup set with customTimeout on update to 4h hits a timeout error of 20m

zbuchheit commented 8 months ago

What happened?

When using aws.eks.NodeGroup with a customTimeout on update set to 4h I hit a timeout error of waiting for EKS Node Group ... timeout while waiting for state to become 'Successful' (last state: 'InProgress', timeout: 20m0s)

Example

Not sure a repro is easily produced here but I will attempt to provide one.

Output of `pulumi about`

N/A

Additional context

It is confusing where this 20m0s comes from as I believe the upstream provider sets this to 1h by default.

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

iwahbe commented 8 months ago

Hey @zbuchheit. Thanks for opening the issue. We can't do much without a repro but please ping me in Slack if it becomes urgent.

@t0yv0 You have been paying attention to timeout related issues. As a shot in the dark, does any obvious fix stand out here.

zbuchheit commented 8 months ago

I appreciate the quick response. I will see if I can get a repro going. I tried digging through the code and the upstream provider, but I am not entirely sure where the 20m is coming from. From what I saw on the upstream provider the default timeout was 60m, but admittedly I could be conflating something here.

zbuchheit commented 7 months ago

I was able to test some scenarios so far. I set the custom timeout on an update to 5s and confirmed to get * waiting for EKS Node Group (zbuchheit-cluster-eksCluster-ece8aec:zbuchheit-nodegroup) version update (79d1faa4-1bd7-3f67-9381-da723ddd202a): timeout while waiting for state to become 'Successful' (last state: 'InProgress', timeout: 5s) I also set the timeout to an update to 21m hoping it would trigger on 20m for some other reason but get a 21m timeout as expected. * waiting for EKS Node Group (zbuchheit-cluster-eksCluster-e4babc5:zbuchheit-nodegroup) version update (8537b8d8-1b9f-3a49-8884-f362bcf9539f): timeout while waiting for state to become 'Successful' (last state: 'InProgress', timeout: 21m0s)

The user that reported the issue is using MLCs so I am not sure if there is something there that could spark an idea of the source. I did confirm from their logs that their resource has customTimeouts=update:"4h" set

t0yv0 commented 7 months ago

From the logs:

 2024-02-16 18:11:09 INFO     --> aws@v6.21.0

But it doesn't list pulumi-eks version which could be interesting here as well.

It could be we are still affected by https://github.com/pulumi/pulumi-eks/issues/393 where the custom timeout doesn't make it to the AWS provider at all.

Or we could be affected by https://github.com/pulumi/pulumi-terraform-bridge/issues/1651 where AWS would ignore the custom timeout for update.

t0yv0 commented 7 months ago

When reproducing it would be great to match provider versions exactly with the customer's scenario.

zbuchheit commented 7 months ago

I don't believe they are using pulumi-eks at all just using aws.eks.nodeGroup resource from the pulumi-aws package. It is inside an MLC though similar to pulumi-eks

zbuchheit commented 7 months ago

ah interesting, on 6.18.0 it worked, but on 6.21.0 i have a update set to 21m and it hits a timeout at 20m. So it looks like a possible regression between 6.18.0 and 6.21.0

iwahbe commented 7 months ago

@zbuchheit Are you able to share the code you used to find the regression?

theplatformer commented 7 months ago

If it helps, this is our code for an aws.eks.NodeGroup resource within an MLC component that has the same issue. The custom timeout is set directly on the resource, not passed in to the component or set on a parent.

private createNodeGroup = (): aws.eks.NodeGroup => {
    return new aws.eks.NodeGroup(
      this.args.nodeGroupName,
      {
        nodeGroupNamePrefix: `${this.args.nodeGroupName}-`,
        clusterName: this.args.clusterName,
        subnetIds: this.args.privateSubnetIds,

        nodeRoleArn: this.nodeGroupRole.arn,

        version: this.args.kubernetesVersion,

        amiType: "BOTTLEROCKET_x86_64",
        releaseVersion: this.args.bottlerocketVersion,

        diskSize: this.args.diskSizeGiB,
        instanceTypes: this.args.instanceTypes,

        capacityType: "ON_DEMAND",
        scalingConfig: {
          desiredSize: this.args.desiredSize,
          maxSize: this.args.maxSize,
          minSize: this.args.minSize,
        },

        updateConfig: {
          maxUnavailable: this.args.maxUnavailable,
        },

        tags: {
          ...this.args.tags,
        },
      },
      {
        parent: this,
        dependsOn: this.nodeGroupRole,
        ignoreChanges: ["scalingConfig.desiredSize"],
        customTimeouts: { update: "90m" },
      }
    );
  };

With 5.41.0 of @pulumi/aws in the provider's package.json it worked as expected. We needed to add the customTimeouts: { update: "90m" } option to support long running roll outs when upgrading the releaseVersion and were running happily with that for a while.

Since we bumped to 6.23.0 last week we see this same behaviour of a hidden 20m timeout, despite the customTimeout.

t0yv0 commented 7 months ago

Thanks so much for the details. I can reproduce and bisect.

  |--------+------|
  | master | bad  |
  | 6.25.0 | ?    |
  | 6.24.2 | ?    |
  | 6.24.0 | ?    |
  | 6.23.0 | ?    |
  | 6.22.2 | ?    |
  | 6.22.1 | ?    |
  | 6.22.0 | bad  |
  | 6.21.0 | ?    |
  | 6.20.1 | ?    |
  | 6.19.0 | bad  |
  | 6.18.2 | good |
  | 6.18.1 | good |
  | 6.18.0 | good |

This regression was introduced in https://github.com/pulumi/pulumi-aws/pull/3333 Upgrading pulumi-terraform-bridge from v3.71.0 to v3.72.0.

More particularly this PR is the source of regression: https://github.com/pulumi/pulumi-terraform-bridge/pull/1648

Reverting is coupled to other changes but I will attempt to fix forward and have an update here shortly.

pulumi / pulumi-aws