pipe-cd / pipecd

The One CD for All {applications, platforms, operations}
https://pipecd.dev
Apache License 2.0
1.09k stars 153 forks source link

[ECS] 503 error occurred while PrimaryRollout because it deletes the canary too early #4710

Open t-kikuc opened 11 months ago

t-kikuc commented 11 months ago

What happened:

The ECS_PRIMARY_ROLLOUT stage deleted the old task sets including the canary task set. -> The traffic-receiving target group lost instances to route traffic. -> 503 Service Temporarily Unavailable happened until the ECS_TRAFFIC_ROUTING stage ended in a Blue/Green case.

What you expected to happen:

The ECS_PRIMARY_ROLLOUT stage should delete only the old PRIMARY task sets and keep the canary task set alive.

We need to fix the below: https://github.com/pipe-cd/pipecd/blob/301e3673f448b6a4d2e86921827b84c937d09002/pkg/app/piped/executor/ecs/ecs.go#L242-L249

How to reproduce it:

When you use ECS_PRIMARY_ROLLOUT for ECS deployments, it will happen. (503 happens in Blue/Green)

Environment:

apiVersion: pipecd.dev/v1beta1
kind: ECSApp
spec:
  name: ecs-elb-bg-issue
  input:
    serviceDefinitionFile: servicedef.yaml
    taskDefinitionFile: taskdef.yaml
    targetGroups:
      primary:
        targetGroupArn: arn:aws:elasticloadbalancing:ap-northeast-1:<account-id>:targetgroup/t-kikuc-ecs-simple-target-group/326xxxxxxxxxxxxx
        containerName: web
        containerPort: 80
      canary:
        targetGroupArn: arn:aws:elasticloadbalancing:ap-northeast-1:<account-id>:targetgroup/t-kikuc-ecs-simple-target-group2/a0bxxxxxxxxxxxxx
        containerName: web
        containerPort: 80
  pipeline:
    stages:
      - name: ECS_CANARY_ROLLOUT
        with:
          scale: 100
      - name: WAIT_APPROVAL
      - name: ECS_TRAFFIC_ROUTING
        with:
          canary: 100
      - name: WAIT_APPROVAL
      - name: ECS_PRIMARY_ROLLOUT
      - name: WAIT_APPROVAL
      - name: ECS_TRAFFIC_ROUTING
        with:
          primary: 100
      - name: WAIT_APPROVAL
      - name: ECS_CANARY_CLEAN

Diagram Current: current

Desired: desired

t-kikuc commented 11 months ago

In the ECS_CANARY_CLEAN stage after ECS_PRIMARY_ROLLOUT, I failed to delete the canary task set because it's already removed.

The logs in Control Plane:

Failed to clean CANARY task set : failed to delete ECS task set : operation error ECS: DeleteTaskSet, https response error StatusCode: 400, RequestID: , TaskSetNotFoundException: Unable to find task set with id on service ecs-bg-broken-service-1.

t-kikuc commented 10 months ago

It's easy to modify not deleting canary while ECS_PRIMARY_ROLLOUT, but it's much easier to modify not deleting all old tasksets while ECS_PRIMARY_ROLLOUT.

I wonder which is better...