spring-cloud / spring-cloud-function

Apache License 2.0
1.04k stars 615 forks source link

Snapstart Integration #967

Closed nbenjamin closed 11 months ago

nbenjamin commented 1 year ago

Currently we are using spring cloud function for AWS lambda. Recently AWS introduced SnapStart to improve the cold start time which helps tremendously the cold start time. As a best practice will have to re-establish any network connection in the function, CraC provide afterRestore and beforeCheckpoint.

So wanted to check if there is any integration possibility to make use of this in spring cloud function

olegz commented 1 year ago

Basically it works. @msailes of AWS and I have tried it and it just works. Yes, there may be a need to deal with some resources, but that is specific to a particular application. However it could be as simple as stopping/starting application context in the respective beforeCheckpoint and afterRestore providing that such resource implements Spring's Lifecycle, which network resources provided by Spring typically do. So yes in the future we can provide such implementation and probably at the level of Spring-core/boot. But it is too early to do anything at this point, so i suggest provide your own implementation of Resource and distribute it with your app for now where you would stop/start AC. For apps that do not deal with network resources of file handlers there is nothing that needs to be done

bala-cbt commented 1 year ago

Dear Oleg, We have been trying to integrate Spring Cloud function with AWS Snapstart to improve cold start times. There has been a significant improvement in the start times. However, we have been facing an issue where the first request to the function call fails (with a 500 status code). All subsequent function calls work fine. We have deployed our functions behind an AWS API Gateway. The functions work fine when we do not use Snapstart (but has longer cold start times). It would be very helpful if you can share some thoughts/advice that can help us troubleshoot this issue. Thank you.

msailes commented 1 year ago

Hi @bala-cbt, that isn't much information. Is there anything more you can share? Errors? Are you writing any runtime hooks? Do you have a public example?

bala-cbt commented 1 year ago

Hi @msailes Thank you for your response. We are using the serverless framework to deploy the spring cloud functions on AWS Lambda. We are not using any runtime hooks. A sample function definition from the serverless.yml is as below:

functions:
  inviteUser:
    handler: org.springframework.cloud.function.adapter.aws.FunctionInvoker::handleRequest
    events:
      - httpApi:
          path: /authenticated/users/invite
          method: post
          authorizer: customAuthorizer
    environment:
      FUNCTION_NAME: inviteUser
      SPRING_CLOUD_FUNCTION_DEFINITION: inviteUser
    runtime: java11
    snapStart: true
    memorySize: 4096
    role: arn:aws:iam::xxxxxx:role/xxxxxx
    package:
      artifact: target/app-0.0.1-SNAPSHOT-aws.jar

When we access the function for the first time, we get a 500 status code (Internal Server error). This is intermittent and we currently do not see any fixed pattern. Subsequent requests to the function are successful. This is what we have from the CloudWatch logs.

For a failed request, there is only one line in the logs RESTORE_START Runtime Version: java:11.v15 Runtime Version ARN: arn:aws:lambda:eu-west-1::runtime:xxxxxxx and we do not see any RESTORE_REPORT or Restore Duration.

For a successful request, we see

RESTORE_START Runtime Version: java:11.v15 Runtime Version ARN: arn:aws:lambda:eu-west-1::runtime:xxxxx
RESTORE_REPORT Restore Duration: 1547.11 ms

and the start of the request and the application logs

Could this issue be due to incorrect restoration of the snapshot? Is there any way we can troubleshoot this? I will try to create and share a public example in a couple of days.

msailes commented 1 year ago

Are you invoking the function before the creation of the version has completed?

bala-cbt commented 1 year ago

After the functions are deployed, they are accessed from a web application. We currently do not have any custom hooks or post processing code. Is there a way we can determine if the function is invoked before the creation of the version and is there a way we can defer the execution of the function until the version creation is completed. It would be very helpful if you can guide me or point me to any documentation that is available.

msailes commented 1 year ago

I'm not very familiar with the serverless framework. I would log into the console and check that the version has an active status at the time when you access the web application.

There might be a feature in serverless framework to delay the next step until this has happened, but I don't know it.

partaloski commented 1 year ago

Me and my team are facing the same issue with an application of ours, it's not constantly happening as far as we know, like - calling the same lambda when it's cold does not guarantee that the restore will fail, it's just so random how the cold Lambdas sometimes fail to restore but sometimes restore with no issues. The CloudWatch logs don't provide us with much more information as well since we're getting a RESTORE_START but not a RESTORE_REPORT after it and that's when it fails.

As mentioned above, the subsequent call goes on with no issues, the response is as expected.

We started thinking of implementing an artificial delay in the handler of the function, but I don't know how that will affect this since we suspect that the issue occurs as soon as dependency injection starts.

msailes commented 1 year ago

If you think this is a problem with Lambda please raise a support ticket with AWS. I would recommend that you check your code isn't failing on startup in any way. A try catch block around your code to double check and log any exceptions would be a good idea.

partaloski commented 1 year ago

There are no logs from our application or any dependency that's being injected, we will do further checks to make sure that the issue is not something that is code-based and submit a ticket. I noticed that creating a new version/release fixes the issue for some reason.

olegz commented 11 months ago

. . .creating a new version/release fixes the issue for some reason. . . is how you are creating a new instancec of the app from the checkpoint.

Anyway, there is nothing that is needed from the standpoint of Spring Cloud Function to integrate with AWS Snapstart, it just works, which is the best part about it. So, i'll close the issue as there is no action to be taken.