newrelic / newrelic-lambda-layers

Source code and utilities to build and publish New Relic's public AWS Lambda layers.
https://newrelic.com/products/serverless-aws-lambda
Apache License 2.0
36 stars 43 forks source link

Layer is a resource hog, requiring a large amount of memory for acceptable cold start #174

Closed nthornton2010 closed 11 months ago

nthornton2010 commented 1 year ago

Description

The Lambda layer drastically slows down the cold starts of the Lambdas they're added to unless you increase the memory, by a lot.

Steps to Reproduce

To test, I created a Lambda function with this body as the body of the handler (Node.js). This Lambda was triggered by Api Gateway. As you can see, there's nothing going on in the function.

export const exampleMethod = (event, context) => {
   return { statusCode: 200 };
};

Here are some numbers using the above - these numbers are all for cold starts

Without the layer installed, 128mb of memory finished a cold start in 2 seconds. 10240mb finished in 1.7 seconds.

Based on the above, anything that requires some level of performance will need a memory allocation somewhere between 2gbs and 4gbs or provisioned concurrency. These can both get expensive over time, especially when the function used to only require 128mbs.

Expected Behaviour

Something much faster without large amounts of resources.

If there is nothing that can be done to improve this, then your documentation should be updated to include increasing memory allocation as part of integrating. The documentation on the serverless plugin should also be updated (I know this is a separate repository).

Your Environment

    newRelic:
        accountId: ####
        trustedAccountKey: ####
        apiKey: ${ssm:/aws/reference/secretsmanager/NewRelicAPIKey}
        enableExtension: true
        enableIntegration: true
        enableFunctionLogs: true
mrickard commented 1 year ago

I'm sorry to hear that you're having difficulty in instrumentation. These cold start times seem much higher than typical performance metrics. The actual memory requirements and cold start overhead can be affected by the number of dependencies being instrumented, or how an application is architected. I'd recommend filing a ticket with Support.

nthornton2010 commented 1 year ago

@mrickard We're seeing this across the board in every Lambda function we have ranging from functions with little to no dependencies (extremely small and simple) to functions with a lot of dependencies. The numbers are all the same. As you can see, my Lambda function with nothing in it (though dependencies were imported) was finishing in 2 seconds on cold start at 128mb and now is finishing at over 30 seconds. The only difference between the 2 is that the NR layer was added.

Unless you think the support ticket will help get this issue looked into and fixed by your development team faster, I won't be opening one. The setup for the layer is very simple and has been done correctly and I feel very strongly that there are clear performance issues on your end.

I'm not the only one seeing these issues, just Googling, I've found others complaining of it, though your support reps keep telling people to disable either part of the integration or the entire integration itself and switch back to CloudWatch Ingestion. If the recommendation is to not integrate because the layer is too slow, then this points to a problem with your layer and not with how others are integrating it.

Some references: https://forum.newrelic.com/s/hubtopic/aAX8W0000008cI4WAI/new-relic-lambda-extension-taking-too-much-time-to-setup https://forum.newrelic.com/s/hubtopic/aAX8W0000008dP4WAI/recent-degraded-performance-when-using-nr-lambda-layer-and-apollo-plugin

We just upgraded one of our services from Node 12 to Node 18 and with it, the layer was also upgraded. The Node 12 functions with the layer (running 1024mb) completed in 500ms. After upgrading to Node 18 and upgrading the layer, the same function is now taking about 4 seconds to complete. This is just 1 of many examples where we're seeing these performance issues.

mrickard commented 1 year ago

@nthornton2010 Opening a ticket with support is the recommended way to address problems like the ones you're seeing.

nthornton2010 commented 1 year ago

I've opened a support ticket. Case # is 00139256.

scanham commented 1 year ago

@mrickard We've seen a similar issue in our lambdas using NewRelicNodeJS14X version >= 109. If I switch to version 108 the lambda cold starts quickly as it has been for months. If I bump it to 109 (or anything higher) then it times out at 6 seconds. If I increase the timeout and the lambda or layer is able to fully initialize then subsequent executions are fast but it's not ideal to have future potential cold starts from concurrent executions also be so slow. I'll open a support ticket as well if that is the recommended path here. I just wanted to add some data from my testing that it seems to have been something that was added at version 109.

nthornton2010 commented 1 year ago

@scanham My ticket is still active (to be fair, I was on PTO last week). I've enabled trace-level logging in 1 of my functions and will provide that to their engineer tomorrow. Up to this point though, I'm getting this feeling that their engineer is going to try to blame something in our service rather than own up to responsibility that there is a clear defect in their Lambda layer. Some of the questions I've gotten that they're trying to answer are why we're doing what we're doing or why are we using some package(s). At the end of the day, New Relic shouldn't care about why my logic is the way it is or why I'm using some package. Their job is to collect data from this. At some point, they stopped doing this well but are now trying to place the blame on their customers.

I was able to track the issue down to us having a lot of imported modules in some of our files (this shouldn't be a problem, they should account for this) but they said that's not a clear representation of how my service works... I deleted ALL logic from the Lambda function and left only the imports and it was still slow. After deleting all of their imports, their layer started quickly. This should have been a great recreation of the bug but they said it's not the same thing 😕 - they want to know how the logic works (when they shouldn't need to).

They also focused on 1 deprecated package that we're still using in the particular example function, ignoring the fact that I said ALL of the functions in all of our Lambda services in our organization are suffering from this and nearly all of our services no longer use this deprecated package. The focus on this deprecated package just shows that they're going to try to say their layer isn't broken 🤦‍♂️ and that it's a me problem.

mrickard commented 1 year ago

@scanham Thank you for the layer version information. Those layers install the same agent version, but 109 installs a largely refactored wrapper, shown in this commit: https://github.com/newrelic/newrelic-lambda-layers/commit/6bb80238c560e33c5d0b9e663702c210566720a5

@nthornton2010 As a quick check, do you notice a performance difference with your Lambda function if you pin to an earlier version? For your Node 18 functions in us-east-1, the last version before the wrapper was refactored is arn:aws:lambda:us-east-1:451483290750:layer:NewRelicNodeJS18X:30.

mrickard commented 1 year ago

@nthornton2010 @scanham @haooliveira84

We've investigated, and the cause of the slowness is a conflict between the behavior of the Node wrapper helper and ESM functions that require a number of imports. In order to wrap functions without a code change, the wrapper awaits the completion of a dynamic import, which is where we're seeing the impact on performance. The impact may be negligible for functions using few dependencies, though functions with several imports are likely to see impact.

We've tested a workaround for this scenario, which involves bypassing the wrapper and instrumenting the handler function manually. There's a PR on this repo's readme documenting it.

For anyone deploying with Serverless, we've logged a ticket to add an option to bypass dynamic wrapping. This should be available for the serverless plugin in one of its next few releases.

nthornton2010 commented 1 year ago

@mrickard I replied to my support ticket just now but also wanted to share it here too. While the sample project I shared on my ticket was using ESM, I've had reports from other teams within my organization that they also experience these issues in projects not using ESM. I have not confirmed this myself and don't have the time to do this at the moment. But based on https://github.com/newrelic/serverless-newrelic-lambda-layers/issues/376, it sounds like others not using ESM are experiencing issues as well.

I completely agree that the number of imports has something to do with it as this is what I found when trying to recreate this issue locally when putting together a sample project, but I'm not sold on it being an ESM issue. That said, we're in the process of transitioning the remaining projects not using ESM to ESM as part of upgrading Node versions in our projects, so as far as our organization goes, we're really only concerned about the ESM solution, but I know many other people are not using ESM projects.

Please reply here when the option to bypass dynamic wrapping is added to the Serverless plugin. I'd like to keep using the plugin long-term.

FWIW - the serverless plugin also doesn't do anything to add the NEW_RELIC_USE_ESM environment variable. We had to enter that manually in our serverless.yml file. That should probably be a separate issue, but I wanted to mention it since ESM-related things seem to be getting worked on.

aec4444 commented 1 year ago

We are experiencing the same issue with our Node18 lambdas using layer arn:aws:lambda:us-east-1:451483290750:layer:NewRelicNodeJS18X:42. We typically allocate 512mb, and we're seeing 4s cold start times. We haven't seen this in the past.

mrickard commented 1 year ago

@aec4444 Are you using ESM? For ESM functions, we're recommending manual wrapping, if you're encountering slow cold start times.

aec4444 commented 1 year ago

@aec4444 Are you using ESM? For ESM functions, we're recommending manual wrapping, if you're encountering slow cold start times.

We are not using ESM but we are converting our lambdas to node 18. However, Ive seen this issue with Node 16 as of late too. I do not have exact timing as to when things got slower, but I've been asked by our engineering teams to dig in and see why things are getting slower. With a microservices structure, 4s cold start (thats what we are seeing) for 3-4 hops is killer.

aec4444 commented 1 year ago

I did some research this morning. Very basic lambda NOT using ESM.

I set up RestAPI endpoints for each of the lambda (so it was easier to invoke them no other purpose) and used Postman to invoke them.

I got the following results for "hello world" like lambdas with the latest NR Layer for the given Node Runtime.

Node 18 - 776ms without NR, 1652ms with NR Node 16 - 691ms without NR, 1491ms with NR Node 14 - 765ms without NR, 1479ms with NR

We are using Node 18, so you can see an expense of almost 900ms. Since we have a microservice type setup, and some requests may take 3 hops, we pay this price 3 times.

We can look into keeping things warm, but that can mean a pretty large expense.

I do not remember this being a problem a few months ago. Maybe something in the layer has caused this? I will run tests with older versions of the layer.

nthornton2010 commented 1 year ago

@aec4444 depending on how often your Lambda functions are hit, you may want to consider trying to increase the amount of memory you have allocated to your Lambdas instead of keeping things warm. You'll have to figure out which is cheaper in the long run (higher memory vs. provisioned concurrency).

aec4444 commented 1 year ago

@aec4444 depending on how often your Lambda functions are hit, you may want to consider trying to increase the amount of memory you have allocated to your Lambdas instead of keeping things warm. You'll have to figure out which is cheaper in the long run (higher memory vs. provisioned concurrency).

Thanks, we can do this. We already have our lambda at 512mb which seems very high for these simple lambdas, but I can try bumping and performing the test again

nthornton2010 commented 1 year ago

@aec4444 depending on how often your Lambda functions are hit, you may want to consider trying to increase the amount of memory you have allocated to your Lambdas instead of keeping things warm. You'll have to figure out which is cheaper in the long run (higher memory vs. provisioned concurrency).

Thanks, we can do this. We already have our lambda at 512mb which seems very high for these simple lambdas, but I can try bumping and performing the test again

We had Lambdas at 128mb that we had to bump up to 2gb to get acceptable times on :unamused: - we did see that 3 gave even better times (nearly the same as NR not being installed) and after that, the cold start time didn't really change, but the 2gb was acceptable for what we needed.

aec4444 commented 1 year ago

Here are some tests with me bumping from 512mb to 2048mb

Node 18 - 697ms without NR, 1191 with NR (repeat timing 41/59) Node 16 - 805ms without NR, 1041 with NR (45/59) Node 14 - 603ms without NR, 1056 with NR (48/67)

So you can see a pretty good reduction. The repeat times are just me running them again to see how they do warmed.

I just do not remember this as such a major problem with the layer 3-4 months ago

aec4444 commented 1 year ago

Here is what confuses me most though my lambdas report, for instance 512MB allocated, 115MB used or in the case after I bumped it, 2048MB allocated, 117MB used. So is the extra memory being used? (obviously, but maybe that log is only counting the invocation not the cold start)

nthornton2010 commented 1 year ago

Here is what confuses me most though my lambdas report, for instance 512MB allocated, 115MB used or in the case after I bumped it, 2048MB allocated, 117MB used. So is the extra memory being used? (obviously, but maybe that log is only counting the invocation not the cold start)

According to AWS, the memory allocation value directly correlates with CPU performance. More memory results in a better CPU (and possibly other things), so the memory isn't actually being used, the other resources that are not shown are.

https://docs.aws.amazon.com/lambda/latest/operatorguide/computing-power.html

aec4444 commented 1 year ago

Here is what confuses me most though my lambdas report, for instance 512MB allocated, 115MB used or in the case after I bumped it, 2048MB allocated, 117MB used. So is the extra memory being used? (obviously, but maybe that log is only counting the invocation not the cold start)

According to AWS, the memory allocation value directly correlates with CPU performance. More memory results in a better CPU (and possibly other things), so the memory isn't actually being used, the other resources that are not shown are.

https://docs.aws.amazon.com/lambda/latest/operatorguide/computing-power.html

Makes sense, I knew more memory meant more CPU I just didn't think NR needed more CPU to start up.. interesting.

aec4444 commented 1 year ago

I ran another experiment, this time just on Node18.

instead of using arn:aws:lambda:us-east-1:451483290750:layer:NewRelicNodeJS18X:42 I used arn:aws:lambda:us-east-1:451483290750:layer:NewRelicNodeJS18X:1

I lowered memory back down to 512mb.

Node 18 - 669ms without NR, 1061ms with NR.

So the layer for Node 18, version 42, by my very unscientific test, takes 600ms longer than version 1 of the same layer

mrickard commented 1 year ago

@aec4444 Would you be able to share your test functions for a repro case?

aec4444 commented 1 year ago
import { logger } from '../../logger/logger';
logger.log({
    message: 'Global Space',
    meta: {
        status: 'starting'
    }
});

// externals
import { Handler, APIGatewayProxyEvent, APIGatewayProxyResult } from 'aws-lambda';

logger.log({
    message: 'Global Space',
    meta: {
        status: 'aws-lambda imported'
    }
});

// config
import { ConfigurationManager } from '../../configuration/configurationManager';
import packageJson from '../../../package.json';

// utils
import { Configuration } from '../../configuration/Configuration';

logger.log({
    message: 'Global Space',
    meta: {
        status: 'local files imported'
    }
});

// get this globally so it can be in place for the lifetime of the lambda
export const configManager = new ConfigurationManager();

logger.log({
    message: 'Global Space',
    meta: {
        status: 'configuration mgr instantiated'
    }
});

// lambda handler as the default this is setup as a rest api
// TODO: Use Typescript Types for the event you're trying to handle
const handler: Handler<APIGatewayProxyEvent> = async (event, context): Promise<APIGatewayProxyResult> => {
    logger.log({
        message: 'Lambda Invocation',
        meta: {
            status: 'starting'
        }
    });
    context.callbackWaitsForEmptyEventLoop = false;

    let config: Configuration;
    if (process.env.GET_PARAMETERS) {
        logger.log({
            message: 'Lambda Invocation',
            meta: {
                status: 'getting config'
            }
        });
        config = await configManager.get(process.env.ENVIRONMENT || 'dev');

        logger.log({
            message: 'Lambda Invocation',
            meta: {
                status: 'received config'
            }
        });
    }

    // returns the 200 success
    const result: APIGatewayProxyResult = {
        statusCode: 200,
        body: JSON.stringify({
            success: true,
            version: packageJson.version,
            currentDate: new Date(),
            region: process.env.AWS_REGION,
            config
        }),
        headers: {
            'Content-Type': 'application/json'
        }
    };

    logger.log({
        message: 'Lambda Invocation',
        meta: {
            status: 'building response'
        }
    });

    return result;
};

export default handler;

logger is just a fancy wrapper for console log so we can structure the log to cloudwatch.. Configuration and Configuration Manager are not being used in my test (since process.env.GET_PARAMETERS is not set)

mrickard commented 1 year ago

@aec4444 Thank you for the repro case. The use of import statements does show this to be a function using ESM (or ECMAScript modules). For customers experiencing long cold start times in ESM functions, we suggest using the Lambda-layer-installed agent, but not setting the handler to the dynamic wrapper. Instead, in Node 18 you can import the New Relic agent and wrap the handler in your function code. See https://docs.newrelic.com/docs/serverless-function-monitoring/aws-lambda-monitoring/enable-lambda-monitoring/enable-serverless-monitoring-aws-lambda-legacy/#node

nthornton2010 commented 1 year ago

@aec4444 Thank you for the repro case. The use of import statements does show this to be a function using ESM (or ECMAScript modules). For customers experiencing long cold start times in ESM functions, we suggest using the Lambda-layer-installed agent, but not setting the handler to the dynamic wrapper. Instead, in Node 18 you can import the New Relic agent and wrap the handler in your function code. See https://docs.newrelic.com/docs/serverless-function-monitoring/aws-lambda-monitoring/enable-lambda-monitoring/enable-serverless-monitoring-aws-lambda-legacy/#node

You can use import without ESM... We've been using import for years and only just started using ESM in our services a few months ago.

aec4444 commented 1 year ago

@aec4444 Thank you for the repro case. The use of import statements does show this to be a function using ESM (or ECMAScript modules). For customers experiencing long cold start times in ESM functions, we suggest using the Lambda-layer-installed agent, but not setting the handler to the dynamic wrapper. Instead, in Node 18 you can import the New Relic agent and wrap the handler in your function code. See https://docs.newrelic.com/docs/serverless-function-monitoring/aws-lambda-monitoring/enable-lambda-monitoring/enable-serverless-monitoring-aws-lambda-legacy/#node

You can use import without ESM... We've been using import for years and only just started using ESM in our services a few months ago.

This is correct. We have not transitioned to ESM yet, but it was on our roadmap

aec4444 commented 1 year ago

@aec4444 Thank you for the repro case. The use of import statements does show this to be a function using ESM (or ECMAScript modules). For customers experiencing long cold start times in ESM functions, we suggest using the Lambda-layer-installed agent, but not setting the handler to the dynamic wrapper. Instead, in Node 18 you can import the New Relic agent and wrap the handler in your function code. See https://docs.newrelic.com/docs/serverless-function-monitoring/aws-lambda-monitoring/enable-lambda-monitoring/enable-serverless-monitoring-aws-lambda-legacy/#node

we aren't using ESM. This is the typescript code we use, then we use esbuild to bundle it. I could send that too

We were hoping not to have to do all the work of instrumenting every lambda. The idea was to use the layer and get that stuff "for free"

mrickard commented 1 year ago

@aec4444 So you're using import, but you're not specifying module as the type? What you're likely running into is the dynamic wrapper importing the full application on line 73. That you're not using the module type...might offer a path to a fix. Would you be able to provide a zip of your test functions and package files?

aec4444 commented 1 year ago

@aec4444 So you're using import, but you're not specifying module as the type? What you're likely running into is the dynamic wrapper importing the full application on line 73. That you're not using the module type...might offer a path to a fix. Would you be able to provide a zip of your test functions and package files?

I made a mistake and pasted the Typescript code we use, not the bundled esbuild generated js code.

mrickard commented 1 year ago

Yeah, import in typescript is fine, because typescript is transpiled to common JS. I'd be interested in seeing the transpiled code as well, though.

aec4444 commented 1 year ago

Yeah, import in typescript is fine, because typescript is transpiled to common JS. I'd be interested in seeing the transpiled code as well, though.

Here is a link to the Code

aec4444 commented 1 year ago

I was able to instrument my code using https://docs.newrelic.com/docs/serverless-function-monitoring/aws-lambda-monitoring/enable-lambda-monitoring/enable-serverless-monitoring-aws-lambda-legacy/#node and also using https://github.com/newrelic/newrelic-lambda-layers/blob/master/README.md to get logs sent down to NR from my lambda.

I found in my unscientific test on 512mb that using the lambda layer for nodejs took 1256ms at this time of night while the "roll your own" method took 1127ms. I'll do more timing in the morning.

This does put a lot more work on our devs which I do not like. I like the idea of "add the layer, set these env variables and you're good to go"...

This timing was all done with the latest versions of the Node 18 lambda layer and v33 of the NR Extension

mrickard commented 1 year ago

@aec4444 Would you be able to refresh your link to the code?

aec4444 commented 1 year ago

@aec4444 Would you be able to refresh your link to the code?

Can you try again? Our security staff changed the access rights to the file

hamilton-s commented 1 year ago

We're also having the same issue and stuck on Lambda Layer 30. Any updates on this?

mrickard commented 1 year ago

@hamilton-s We've been able to reproduce cold-start slowness only with ESM functions using the dynamic wrapper. (For those cases, we recommend wrapping the handler function manually. You can still use the layer, but not the wrapper. This is due to the requirements of applying instrumentation to ES modules in Node.)

We've received repro cases from a few customers for slowness in non-ESM functions, but haven't been able to observe excessive cold start times with the repro cases. (When we ran the repros, we were seeing ~800-900ms cold start times.)

I'd encourage you to open a ticket with support. If you have a repro that can be run, we'd be happy to try it.

nthornton2010 commented 1 year ago

@mrickard did you try @aec4444 sample? They're not using ESM. In the original sample I provided, you can very easily disable ESM and still see the issue as well. I don't believe we should have to do all of the triaging for you. We're all paying customers who have more important things to do than triage New Relic's broken things. Samples have been provided to you by myself and someone else. Please use them.

mrickard commented 1 year ago

@nthornton2010 We have in fact tested with @aec4444. Here's what we've seen using that function and the latest wrapper.

Agent v9.15.0: 822 ms cold start time Agent v10.6.2: 840 ms cold start time Agent v11.4.0: 902 ms cold start time Agent v11.4.0 with a subset of internal dependencies: 898 ms cold start time

The older version of the wrapper performed similarly--in the 800-900ms range.

aec4444 commented 1 year ago

@mrickard I have a whole support case opened with your support engineer citing stats that show its slower. (Case 00156485)

The issue is you guys are starting with 1024 of memory, which our lambda does not need and without the NR Layer involved, runs efficiently with. I totally understand if we need more memory to support the layer. Is it possible to post minimum requirements?

mrickard commented 11 months ago

We've just done a new release of Node layers that wrap the user handler earlier instead of deferring until invocation. In tests, this improves performance for Lambdas with memory set below 1024. (We tested down to 128MB.) If you have a chance, please give them a try.