pulumi / pulumi-awsx

AWS infrastructure best practices in component form!
https://www.pulumi.com/docs/guides/crosswalk/aws/
Apache License 2.0
222 stars 104 forks source link

Fargate service updating failing with timeout #354

Open Ilchuk-Mihail opened 5 years ago

Ilchuk-Mihail commented 5 years ago

I am using the following code to deploy container (some things with certificate, route53 dropped out to simplify )

 const nlb = new awsx.elasticloadbalancingv2.NetworkLoadBalancer("some-nlb", {external: true });
 const tg = nlb.createTargetGroup("some-tg", { port: 3000 });
 const listener = tg.createListener("some-http-listener", { port: 80 });
const listener2 = tg.createListener('https-listener', {
    certificateArn: certificate.arn,
     loadBalancerArn: nlb.arn,
    port: 443,
    protocol: "TLS",
    sslPolicy: "ELBSecurityPolicy-2016-08"
}); 

let service = new awsx.ecs.FargateService("some-app", {
    desiredCount: 1,
    taskDefinitionArgs: {
        containers: {
            myapp: {
                image: awsx.ecs.Image.fromPath("app", "../"),
                memory: 512,
                portMappings: [listener],
            },
        },
    },
});

It deploys correctly, but if I try to run it again, it stucks on the updating taskDefinition, and throw timeout error:

     ├─ awsx:x:ecs:FargateTaskDefinition  
 +-  │  └─ aws:ecs:TaskDefinition         replaced                [diff: ~containerDefinitions]
     └─ awsx:x:ecs:FargateService         
 ~      └─ aws:ecs:Service                **updating failed**     [diff: ~taskDefinition]; 1 error

 error: Plan apply failed: 1 error occurred:
        * updating urn:pulumi:dev::drjones::awsx:x:ecs:FargateService$aws:ecs/service:Service::service: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)

And it always crashes on process updating taskDefinition, after this, I have +1 running task, but not replacing existing. But my changes are successfully applied (which I made in app code). So I see the changes which I made in container, but script fail and I have 2 tasks not one.

After some time this second task is stopped on aws. But script on the next pulumi up run tries to delete task from previous deployment.

Weird situation, which affects consistency, and makes it impossible to do updating correctly. Anyway it looks like some processing up to 10 min and then fail.

Blitz2145 commented 5 years ago

Also ran into this with the following similar error message:

pulumi:pulumi:Stack (blah-api-qa):
    error: update failed

  aws:ecs:Service (qa-blah):
    error: Plan apply failed: 1 error occurred:
        * updating urn:pulumi:qa::blah-api::awsx:x:ecs:FargateService$aws:ecs/service:Service::qa-blah: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)
    error: post-step event returned an error: failed to save snapshot: [403] The provided update token has expired.

Next pulumi up run tries to delete task def. from previous deployment. And fails with the following:

error: the current deployment has 1 resource(s) with pending operations:
  * urn:pulumi:qa::blah-api::awsx:x:ecs:FargateService$aws:ecs/service:Service::qa-blah, interrupted while updating

These resources are in an unknown state because the Pulumi CLI was interrupted while
waiting for changes to these resources to complete. You should confirm whether or not the
operations listed completed successfully by checking the state of the appropriate provider.
For example, if you are using AWS, you can confirm using the AWS Console.

Once you have confirmed the status of the interrupted operations, you can repair your stack
using 'pulumi stack export' to export your stack to a file. For each operation that succeeded,
remove that operation from the "pending_operations" section of the file. Once this is complete,
use 'pulumi stack import' to import the repaired stack.
Blitz2145 commented 5 years ago

I think this particular timeout occurs when the fargate service cannot reach steady state, because the new task fails to startup, at least that was happening here because the container was erroring on boot.

evereq commented 5 years ago

Looks to be similar to https://github.com/pulumi/pulumi-awsx/issues/300.

We also constantly got into such issues (in https://github.com/ever-co/gauzy-pulumi) and I wonder if there is any way to increase timeout?

kingwill101 commented 4 years ago

been having a similar issue lately

Updating (ef):

     Type                                 Name      Status                  Info
     pulumi:pulumi:Stack                  infra-ef  **failed**              1 error
     ├─ awsx:x:ecs:FargateTaskDefinition  app-svc2                          1 warning
     └─ awsx:x:ecs:FargateService         app-svc2
 ~      └─ aws:ecs:Service                app-svc2  **updating failed**     1 error

Diagnostics:
  awsx:x:ecs:FargateTaskDefinition (app-svc2):

  aws:ecs:Service (app-svc2):
    error: Plan apply failed: 1 error occurred:
        * updating urn:pulumi:ef::infra::awsx:x:ecs:FargateService$aws:ecs/service:Service::app-svc2: timeout while waiting for state to become 'true' (last state: 'false', timeout: 10m0s)

  pulumi:pulumi:Stack (infra-ef):
    error: update failed

Resources:
    37 unchanged

Duration: 10m45s

Is there a solution to this issue? running again tends to not always work

nachomazzara commented 4 years ago

Still having this issue!

lukehoban commented 4 years ago

We likely should increase the timeout defaults here. The default settings for other properties make it reasonably common that updates to the Fargate Service take more than 10 minutes due both to Fargate launch times and defaults for draining the old service during the update. In fact, in my experience it is very common for this process to take just slightly over 10 minutes in many cases - which makes the current 10 minute default timeout particularly frustrating.

duro commented 4 years ago

This definitely needs to be addressed. I am getting this failure 4 out of 5 times. Everytime the cluster does actually reach the correct state, only Pulumi fails. I even tried adding this to the resource options:

customTimeouts: { update: '20m' }

But it has not effect.

lukehoban commented 4 years ago

The simplest workaround right now is to pass waitForSteadyState: false, to the FargateService. This will tell Pulumi not to wait for ECS to mark the Service as "ready" in the new deployed state. That does mean that other operations will continue even before the new service is serving traffic, but will avoid hitting this timeout.

The 10m timeout here is hardcoded currently, and will need to be fixed upstream.

Unfortunately, it is very common for FargateService's backed by a load balancer to take just slightly over 10 minutes to update, due to the default helath checks and draining for NLB + Fargate.

For example, I reproed the issue with this timing of events from ECS - showing the update did actually complete in 10m47s even though the Pulumi deployment timed out at 10m. Note that there are two almost exactly 5m delays - one after starting the new task where NLB is waiting to mark the task as healthy, and another after the new task is ready where the old task is allowed to drain.

2020-06-25 17:41:09 -0700 service some-app-fef01c4 has reached a steady state.
2020-06-25 17:40:51 -0700 service some-app-fef01c4 has stopped 1 running tasks: task 5d2e29df-80d7-4456-ac79-59d134cd9c90.
2020-06-25 17:35:46 -0700 service some-app-fef01c4 has begun draining connections on 1 tasks.
2020-06-25 17:35:46 -0700 service some-app-fef01c4 deregistered 1 targets in target-group some-tg-94819bd
2020-06-25 17:31:07 -0700 service some-app-fef01c4 registered 1 targets in target-group some-tg-94819bd
2020-06-25 17:30:22 -0700 service some-app-fef01c4 has started 1 tasks: task be8bcdec-e2d6-4fb4-a0d2-9b4d9ab80316.

Because of these, other solutions are to pass deregistrationDelay: 0, on the Target (or a number smaller than the default 300), and/or set healthCheck on the Target and healthCheckGracePeriodSeconds on the FargateService to allow it to register as healthy more quickly (if possible for your container). These are more complex, but have the benefit of actually speeding up how long it takes.

lukehoban commented 4 years ago

This should be fixed with https://github.com/pulumi/terraform-provider-aws/pull/59 which was released as part of @pulumi/aws 2.11.0. Upgrading to that version should avoid hitting any timeouts here under normal conditions.

duro commented 4 years ago

@lukehoban just heads up, I upgraded, and I am still getting the error. Only difference is the timout duration is longer:

timeout while waiting for state to become 'true' (last state: 'false', timeout: 15m0s)

Here is my resource def:

new awsx.ecs.FargateService(
  `backend`,
  {
    cluster: cluster,
    securityGroups: [serviceSG.id],
    taskDefinitionArgs: {
      taskRole: apiRole,
      containers: {
        api: {
          image: image,
          memory: 1024,
          portMappings: [listener],
          command: ['yarn', 'start:prod'],
          environment: [
            {
              name: 'APP_ENV',
              value: mapAppEnv(ENV),
            },
          ],
        },
      },
    },
    desiredCount: 1,
  },
  { dependsOn: [serviceSG, runMigrations] },
)
lukehoban commented 4 years ago

@duro I assume this is a case where the Fargate Service does ultimately succeed in it's update (even though Pulumi times out)? Could you share a log from your ECS console on the events that happened during the update? Greater than 15 minutes for an ECS Service update is surprising under any conditions I'm aware of - so would love to better understand so we can identify the right default here.

duro commented 4 years ago

@lukehoban I'm a little unclear on which logs would be most valuable to get the right picture.

lukehoban commented 4 years ago

which logs would be most valuable to get the right picture

From the ECS console, on the Service page, the "Events" tab. Just the section of logs from when the Service started updating to when it reached a new "steady state" after the update. This should be logs that look like those copied into https://github.com/pulumi/pulumi-awsx/issues/354#issuecomment-649981532.

Screen Shot 2020-06-28 at 2 58 29 PM

duro commented 4 years ago

@lukehoban so now I'm having a hard time replicating this. Deploys seem to be completing inside the 15 minutes. I will post back here once (if) I see the issue again.

duro commented 4 years ago

@lukehoban Ok, it's returned as an issue. Here are the logs from the Fargate service events tab:

2020-06-30 12:54:52 -0700 service backend-4095c50 has reached a steady state.
2020-06-30 12:54:30 -0700 service backend-4095c50 has stopped 1 running tasks: task b482b992-b073-4575-a504-941a5fc95b73.
2020-06-30 12:52:27 -0700 service backend-4095c50 has begun draining connections on 1 tasks.
2020-06-30 12:52:27 -0700 service backend-4095c50 deregistered 1 targets in target-group slynx-api-50ed6e4
2020-06-30 12:51:44 -0700 service backend-4095c50 registered 1 targets in target-group slynx-api-50ed6e4
2020-06-30 12:49:55 -0700 service backend-4095c50 has started 1 tasks: task 7784ffbd-007a-40e7-92ea-36ef074c1ba8.

The odd thing I'm seeing is that the start time vs steady state does seem to be less than 10min.

duro commented 4 years ago

@lukehoban have you had a chance to look this over? Is there anything else I can send you to help investigate. I am still getting this error consistently.

lukehoban commented 4 years ago

@duro Sorry for the delay. The snippet you shared seems to show a case where the service fully completed the update in under 5 minutes. Are you sure that was an update that timed out after 15 minutes in Pulumi?

We can definitely consider bumping this higher than 15 minutes by default, but notably there is a user experience tradeoff where this may mean that certain failures for the service to become healthy take a longer time to cause the Service update to fail. So I'd love to better understand what leads to these cases where the update takes more than 15 minutes. If you have any more details you can share that would be helpful.

duro commented 4 years ago

@lukehoban I am certain that in the case I sent you those were the logs from a run which failed due to timeout.

That said, 4 out of the last 5 runs have succeeded, but we still occasionally get these timeout related errors.

Is there anything in Palumi that could be factoring in dependencies of the service into this time?

For example, our Farget service has a dependency on a custom resource we created that runs DB migrations via a Fargate "task". Could it be that the time that task takes is also being counted by Palumi as part of the 15 minutes?

jgrocott commented 4 years ago

@lukehoban I believe i'm encountering this same issue however for me it was triggered by adding a taskRole to my taskDefinitionArgs.

this.fargateService = new awsx.ecs.FargateService(`${name}-app`, {
            cluster: this.cluster,
            assignPublicIp: false,
            subnets: args.appSubnetIds,
            securityGroups: appSecGroup,
            desiredCount: (args.appResources || {}).desiredCount,

            taskDefinitionArgs: {
                taskRole: args.taskRole, // <-- This for me is the culprit.
                containers: {
                    [name]: {
                        ...args.appResources, // cpu, memory, etc.
                        image: args.appImage,
                        portMappings: [target],
                        environment: args.environment,

                    },
                },
            },
        }, { parent: this.cluster });

If this is access related, is there a preferred method for extending the existing role which is assigned as a default by Pulumi? In my particular instance i've attempted to assign a custom role to allow SES access.

jgrocott commented 4 years ago

I managed to resolve my issue. I had the wrong service associated with the taskRole policy. It was set to ecs.amazonaws.com instead of ecs-tasks.amazonaws.com.

jacbar commented 3 years ago

Hey. I'm seeing very similar behaviour in my pulumi task runner from azure devops.

ECS task takes >10mins to move to steady, but pulumi is time outing after 10 mins

ECS Events (~11 mins)

2020-12-14 16:36:44 +0100 service app-service has reached a steady state.
2020-12-14 16:36:44 +0100 service app-service (deployment ecs-svc/) deployment completed.
2020-12-14 16:36:25 +0100 service app-service has stopped 1 running tasks: task 111
2020-12-14 16:31:22 +0100 service app-service has begun draining connections on 1 tasks.
2020-12-14 16:31:22 +0100 service app-service deregistered 1 targets in target-group engine80-286ce9e
2020-12-14 16:26:49 +0100 service app-service registered 1 targets in target-group engine80-286ce9e
2020-12-14 16:25:39 +0100 service app-service has started 1 tasks: task 222

Azure devops pulumi version:

Starting: Run pulumi  Service
==============================================================================
Task         : Pulumi Azure Pipelines Task
Description  : Azure Pipelines task extension for running Pulumi apps.
Version      : 1.0.7
Author       : Pulumi
Help         : Join us on Slack at https://slack.pulumi.com.
==============================================================================
Downloading: https://get.pulumi.com/releases/sdk/pulumi-v2.15.6-linux-x64.tar.gz
Extracting archive
/bin/tar xzC /home/vsts/work/_temp/123 -f /home/vsts/work/_temp/123

Prepending PATH environment variable with directory: /home/vsts/work/_temp/123/pulumi
/home/vsts/work/_temp/123/pulumi/pulumi version
v2.15.6

Azure devops outcome:

  aws:ecs:Service (app-service):
    error: 1 error occurred:
        * updating urn:pulumi:dev::stack-id::awsx:x:ecs:FargateService$aws:ecs/service:Service::app-service: 1 error occurred:
        * error waiting for service (arn:arn/app-service) to reach a steady state: ResourceNotReady: exceeded wait attempts

  pulumi:pulumi:Stack (stack-id):
    error: update failed

Resources:
    + 2 created
    ~ 1 updated
    +-2 replaced
    5 changes. 59 unchanged

Duration: 10m15s

Pulumi nodejs packages version:

    "@pulumi/aws": "3.19.2",
    "@pulumi/awsx": "0.22.0",
    "@pulumi/pulumi": "2.15.4",

Azure Devops pipelines using pulumi v2.15.4 were working fine last week.

lukehoban commented 3 years ago

Azure Devops pipelines using pulumi v2.15.4 were working fine last week.

Just to make sure - was these something that changed?

Duration: 10m15s

This definitely doesn't look to be expected - I will take another look into this.

jacbar commented 3 years ago

Hey. Yes the pulumi version has changed. Last week pipelines were using v2.15.4 this week are using v2.15.6. But I forced azure devops pipeline to use v2.15.4 and it's time outing.

volkanunsal commented 3 years ago

Seeing the same issue with this error:

 error waiting for service to reach a steady state: ResourceNotReady: exceeded wait attempts
austinbutler commented 3 years ago

@lukehoban I'm on Pulumi v2.23.2 and getting this error just after 10m with a normal pulumi/aws (not awsx) ECS deploy (non-fargate). Was the timeout change to 15m only for awsx?

* error waiting for service (...) to reach a steady state: ResourceNotReady: exceeded wait attempts

Duration: 10m4s
leezen commented 3 years ago

@austinbutler This has actually changed upstream. Previously, there was a custom waiter and now it relies on the AWS SDK, which hardcodes a 10 minute timeout (see https://github.com/hashicorp/terraform-provider-aws/pull/3485).

austinbutler commented 3 years ago

Thanks @leezen. Are there any plans on Pulumi's end to support a custom waiter?

leezen commented 3 years ago

I think ideally, we'd prefer a way to have this work cleanly instead of reverting the upstream waiter logic. I've filed https://github.com/aws/aws-sdk-go/issues/3844 to help with that. Separately, I think we could consider updating some of the defaults in awsx itself to make this less likely to happen.

leezen commented 3 years ago

I totally missed there are WithContext methods, so I think we may want to consider using that in order to update the timeout.

sivanovnr commented 1 year ago

Hey, I came here to report that this behavior still exists.

We are just trying to implement Pulumi and noticed that our ECS service constantly "breaks" our pipeline, by timing out after 45 minutes.

This happens on a creation of a new ECS service:

 +  aws:ecs:Service test-service-pulumi creating (2714s) error: 1 error occurred:
 +  aws:ecs:Service test-service-pulumi **creating failed** error: 1 error occurred:
@ updating....
    pulumi:pulumi:Stack pulumi-test-service-test-service-qa running error: update failed
    pulumi:pulumi:Stack pulumi-test-service-test-service-qa **failed** 1 error; 1 warning

Diagnostics:
  aws:ecs:Service (test-service-pulumi):
    error: 1 error occurred:
        * creating urn:pulumi:test-service-qa::pulumi-test-service::awsx:ecs:FargateService$aws:ecs/service:Service::test-service-pulumi: 1 error occurred:

        * error waiting for ECS service (arn:aws:ecs:eu-west-1:111222333444:service/qa-ecs-cluster/test-service-pulumi-f5a0d92) to reach steady state after creation: ResourceNotReady: exceeded wait attempts

Pulumi about:

Dependencies:
NAME            VERSION
@pulumi/awsx    1.0.1
@pulumi/pulumi  3.50.2
@types/node     14.18.35
@pulumi/aws     5.24.0

We will most likely apply the waitForSteadyState: false workaround.

corymhall commented 3 weeks ago

You should now be able to update the timeout to a custom value using the transform option. We have another issue (#1112) that is tracking better handling for when errors occur, but I think this issue can be closed. If anyone is still experiencing this issue and the transforms workaround does not work, let me know.

new awsx.ecs.FargateService('service', {...}, {
  transforms: [
      (args: pulumi.ResourceTransformArgs): pulumi.ResourceTransformResult => {
        if (args.type === 'aws:ecs/service:Service') {
          args.opts = pulumi.mergeOptions(args.opts, {customTimeouts: { create: '30m' }})
        }
        return {
          props: args.props,
          opts: args.opts,
        };
      },
  ],
});