spinnaker-plugins / aws-lambda-deployment-plugin-spinnaker

Spinnaker plugin to support deployment of AWS Lambda functions via Spinnaker pipelines
Apache License 2.0
23 stars 22 forks source link

LambdaUpdateCodeStage may not effectively publish the function after updating the code #89

Closed nimakaviani closed 2 years ago

nimakaviani commented 3 years ago

LambdaUpdateCodeStage may not effectively publish the function after updating the code. This could result in traffic getting routed to the older version of the function.

It is expected that after updating the code for the function, routing traffic to the function should result in the latest version of the function with the expected code changes to respond back to the requests.

{
  "appConfig": {},
  "expectedArtifacts": [],
  "keepWaitingPipelines": false,
  "lastModifiedBy": "admin",
  "limitConcurrent": true,
  "parameterConfig": [],
  "spinLibJsonnetVersion": "0.0.0",
  "stages": [
    {
      "account": "my-aws-account",
      "aliases": [
        "v4"
      ],
      "batchsize": 10,
      "cloudProvider": "aws",
      "deadLetterConfig": {
        "targetArn": ""
      },
      "detailName": "",
      "enableLambdaAtEdge": false,
      "envVariables": {
        "TEST_VAR": "my_value"
      },
      "functionName": "lambda-mystack-my-function2",
      "functionUid": "my-function2",
      "handler": "main",
      "memorySize": 128,
      "name": "Deploy Spinnaker Deploy Lambda",
      "publish": true,
      "refId": "6",
      "region": "us-east-1",
      "requisiteStageRefIds": [],
      "reservedConcurrentExecutions": 10,
      "role": "arn:aws:iam::xxx",
      "runtime": "go1.x",
      "s3bucket": "lambda-bucket-east",
      "s3key": "func1.zip",
      "securityGroupIds": [
        "sg-zzz"
      ],
      "stackName": "mystack",
      "subnetIds": [
        "subnet-xxx,
        "subnet-xxxx"
      ],
      "tags": {
        "adsk:moniker": "lambda",
        "purpose": "Lambda Deployment",
        "squad": "Fauda"
      },
      "timeout": 90,
      "tracingConfig": {
        "mode": "PassThrough"
      },
      "triggerArns": [
        "arn:aws:kinesis:us-xxx"
      ],
      "type": "Aws.LambdaDeploymentStage",
      "vpcId": "vpc-xxx"
    },
    {
      "account": "my-aws-account",
      "aliasName": "v4",
      "deploymentStrategy": "$BLUEGREEN",
      "destroyOnFail": true,
      "functionName": "lambda-mystack-my-function2",
      "healthCheckType": "$LAMBDA",
      "name": "AWS Lambda Routing",
      "outputArtifact": {
        "account": "embedded-artifact",
        "artifact": {
          "artifactAccount": "embedded-artifact",
          "id": "fce2deaf-6cdc-4813-95bb-938ddf5f1581",
          "name": "result",
          "reference": "IlN1cCBqYWNrISI=",
          "type": "embedded/base64"
        }
      },
      "payloadArtifact": {
        "account": "embedded-artifact",
        "artifact": {
          "artifactAccount": "embedded-artifact",
          "id": "8ea866d2-f2a2-4d63-8be9-a402cf2a375f",
          "name": "payload",
          "reference": "ewogICJOYW1lIjogImphY2siCn0=",
          "type": "embedded/base64"
        }
      },
      "refId": "7",
      "region": "us-east-1",
      "requisiteStageRefIds": [
        "9"
      ],
      "timeout": 30,
      "type": "Aws.LambdaTrafficRoutingStage"
    },
    {
      "account": "my-aws-account",
      "functionName": "lambda-mystack-my-function2",
      "name": "AWS Lambda Delete",
      "refId": "8",
      "region": "us-east-1",
      "requisiteStageRefIds": [
        "7"
      ],
      "type": "Aws.LambdaDeleteStage",
      "version": "$ALL"
    },
    {
      "account": "my-aws-account",
      "functionName": "lambda-mystack-my-function2",
      "name": "AWS Lambda Update Code",
      "publish": true,
      "refId": "9",
      "region": "us-east-1",
      "requisiteStageRefIds": [
        "6"
      ],
      "s3bucket": "lambda-bucket-east",
      "s3key": "func2.zip",
      "type": "Aws.LambdaUpdateCodeStage"
    }
  ],
  "triggers": [],
  "updateTs": "1627446556000"
}

I believe the error is happening because unlike the LambdaDeploy stage where we publish the deployed revision of the function and wait for the published revision to appear in the cache, when we update the code, we donot respect whether or not the publish flag is checked and do not wait for the new revision to be published prior to routing traffic to it. should be an easy fix, is the problem is due to not properly publishing the latest revision of the function.

dkirillov commented 3 years ago

I've tried to reproduce today and found a bit more detail on this issue. Particularly, it only effects newly deploy lambda functions - if a lambda function has been deployed and published a version once, subsequent runs don't seem to be be affected by this bug.

The stages used are - "AWS Lambda Deployment", "AWS Lambda Update Code", "AWS Lambda Route" The "AWS Lambda Route is used primary to send a payload and check that it receives the expected output - if it doesn't receive the expected output, it fails.

The first set of 3 runs looked like the following: Screen_Shot_2021-08-16_at_10_54_20_AM Screen_Shot_2021-08-16_at_10_49_40_AM Screen_Shot_2021-08-16_at_10_56_23_AM The versions also seem to be a bit off. The first time that the pipeline ran, only 1 version was created. The second time that the pipeline ran, 3 versions were created. And third time that it ran, 2 versions. It makes sense that there would be 6 versions ((1 deploy + 1 update) * 3 runs = 6 versions), however the groups/runs in which they got created in don't make sense.

The first time that the pipeline ran, it crashed, subsequent runs - with the same parameters - succeeded. This got me thinking that it needs to be a new lambda.

The error for the route stage was as expected, the event was sent to an old/not-updated version of the lambda, which caused it to crash.

I deleted the lambda, by hand, and did another set of 3 runs.

Second set of 3 runs turned out the same: Screen_Shot_2021-08-16_at_11_14_31_AM Screen_Shot_2021-08-16_at_11_01_46_AM Screen_Shot_2021-08-16_at_11_08_13_AM Screen_Shot_2021-08-16_at_11_14_53_AM Again, the first pipeline - that created the new lambda - route stage failed due to the code not being updated, but subsequent runs it passed because the code was updated.

The groups in which the versions were created this time were slightly different. Run 1 produced 1 version (7). Run 2 produced 2 versions (8,9). Run 3 produced 3 versions (10, 11, 12). The number of version, 6, makes sense - but again, how/their-grouping doesn't make sense.

Another interesting thing with versions - the second set started with version 7. This I'm guessing is that the version gets leftover in Spinnaker's cache. But, this is a separate issue from the original one.


Captured this in a pipeline that can be re-ran on schedule: Screen_Shot_2021-08-16_at_11_37_29_AM

The route stage is set to ignore failure so that the subsequent delete stage is ran. After the delete stage there's a pre-condition stage that will check the route stage and fail the pipeline.

The JSON representation of the pipeline above (modify values to fit your environment): https://gist.github.com/dkirillov/609bf63587989bff6464be2fdda04363 (too long to paste in here)