newrelic / newrelic-lambda-layers

Source code and utilities to build and publish New Relic's public AWS Lambda layers.
https://newrelic.com/products/serverless-aws-lambda
Apache License 2.0
34 stars 42 forks source link

When NewRelic is down, the lambda (nodejs) falls in timeout and emits a slow response #202

Closed fmuracciole closed 4 months ago

fmuracciole commented 7 months ago

In december 23' EU NewRelic servers had a severe outage, causing lambdas to timed out and reach the max concurrency limits of the AWS account. Our production applications became very slow and became down. We had to remove the newrelic layers on each lambda to resume the normal activity

Description

We are in eu-west-1 using the arn:aws:lambda:eu-west-1:451483290750:layer:NewRelicNodeJS18XARM64:57 within a serverless configuration

Steps to Reproduce

Put a hello world lambda in VPC, with a quick setup using the "serverless-newrelic-lambda-layers": "5.0.0" The Security Group of the VPC must be void of rules (inbound and outbound), the aim is to reproduce the outage of the new relic servers.

extract the serverless.yml part

service: hello-world

plugins:
  - serverless-webpack
  - serverless-newrelic-lambda-layers

custom:
  newRelic:
    nrRegion: eu
    linkedAccount: xxxx
    accountId: xxxx
    apiKey: xxx
    enableIntegration: false
    enableExtension: true
    enableFunctionLogs: true
    debug: true
    logLevel: info
    enableExtensionLogs: false

provider:
  environment:
    REGION: eu-west-1
    NEW_RELIC_LICENSE_KEY: XXXXX

functions:
  TestNewRelic:
    handler: src/handler/test.handler
    role: !GetAtt LambdaVpcRole.Arn
    vpc:
      subnetIds: XXXXX
      securityGroupIds: 
      - { Ref: "SecurityGroupVpc" }
    events:
      - http:
          method: GET
          path: /testHW

the lambda code test.ts

export const handler = async () => {
    console.log('Some business stuff');

    return { statusCode: 200, body: 'Hello World' };
};

Expected Behaviour

This code must run within 50ms from the client, 5ms from the inside lambda like this.

2024-01-31T15:59:54.637+01:00 START RequestId: 000f0416-4591-4461-b12c-91cd8184aaea Version: $LATEST
2024-01-31T15:59:54.638+01:00 2024-01-31T14:59:54.638Z 000f0416-4591-4461-b12c-91cd8184aaea INFO Some business stuff
2024-01-31T15:59:54.642+01:00 END RequestId: 000f0416-4591-4461-b12c-91cd8184aaea
2024-01-31T15:59:54.642+01:00 REPORT RequestId: 000f0416-4591-4461-b12c-91cd8184aaea Duration: 4.81 ms Billed Duration: 5 ms Memory Size: 1024 MB Max Memory Used: 125 MB

To have those results i 've add the outbound rule All TCP 0.0.0.0/0 on the vpc security group

Relevant Logs / Console output

When removing the outbound rule, the client get the reponse with 1.2s and the lambda continues to run until the timeout plus a 2s penalty, and fail.

2024-01-31T16:05:23.899+01:00 | INIT_START Runtime Version: nodejs:18.v20 Runtime Version ARN: arn:aws:lambda:eu-west-1::runtime:8c909ba80c0363d759e5fc7f9aa6b9fd23ae563b082256816c4ccdcb11de3748
2024-01-31T16:05:23.977+01:00 | [NR_EXT] New Relic Lambda Extension starting up
2024-01-31T16:05:23.985+01:00 | LOGS Name: newrelic-lambda-extension State: Subscribed Types: [Platform, Function]
2024-01-31T16:05:24.627+01:00 | EXTENSION Name: newrelic-lambda-extension State: Ready Events: [INVOKE, SHUTDOWN]
2024-01-31T16:05:24.630+01:00 | START RequestId: b6200688-caa5-4623-86e4-9d7549b5f8b9 Version: $LATEST
2024-01-31T16:05:24.640+01:00 | 2024-01-31T15:05:24.640Z b6200688-caa5-4623-86e4-9d7549b5f8b9 INFO Some business stuff
2024-01-31T16:05:32.637+01:00 | 2024-01-31T15:05:32.637Z b6200688-caa5-4623-86e4-9d7549b5f8b9 Task timed out after 8.01 seconds
2024-01-31T16:05:32.637+01:00 | END RequestId: b6200688-caa5-4623-86e4-9d7549b5f8b9
2024-01-31T16:05:32.637+01:00 | REPORT RequestId: b6200688-caa5-4623-86e4-9d7549b5f8b9 Duration: 8007.69 ms Billed Duration: 6000 ms Memory Size: 1024 MB Max Memory Used: 108 MB Init Duration: 728.19 ms

Your environment

AWS, NodeJS 18.x, Typescript, layers 5.0.0, automatic wrapper

Additional context

Thanks in advance.

mrickard commented 7 months ago

@fmuracciole Thank you for the report and repro. The Node Agent doesn't do any connection to NR in serverless mode, but the Lambda Extension does deliver telemetry and logs to NR, and is a sidecar process of the Lambda execution environment, so I'm going to raise this with the team working on the Extension.

As a short-term remediation, you could set NEW_RELIC_LAMBDA_EXTENSION_ENABLED to false (or enableExtension to false if you're using serverless-newrelic-lambda-layers) and rely on telemetry being delivered via the log ingestion Lambda function, which operates independently from the function being instrumented.

fmuracciole commented 7 months ago

Thanks for your reply, we'll wait for the reply of the Extension Team !

chaudharysaket commented 5 months ago

@fmuracciole The log Task timed out after 8.01 seconds indicates lambda task timeout is set to ~8sec in the configuration. Please use NEW_RELIC_DATA_COLLECTION_TIMEOUT environment variable and set it to say 4s, reference code. Thank you!

chaudharysaket commented 4 months ago

We added timeout to http calls in the extension. When there is any issue with network call in lambda, the http calls fail fast.