Function crash details not reported

DCVortexxx commented 5 months ago

Expected behavior

When a function crashes for an unexpected reason (fatalError, memory corruption...), the stack trace and details of the error are not stored/logged/reported to CloudWatch.

I'm fairly new to server-side Swift and AWS in general, so if I'm missing something, feel free to point it out. 🙏

Actual behavior

I would like to have some more informations about the crash, in order to debug and fix crashes in my lambda.

The only details I can see in CloudWatch when my function crashes are:

RequestId: xyz Error: Runtime exited with error: signal: illegal instruction
Runtime.ExitError

Steps to reproduce

Create a new lambda function
Make your function crash on purpose (using fatalError for instance)
Deploy the function
Execute it

If possible, minimal yet complete reproducer code (or URL to code)

You can simply use the ErrorHandling example from this repository.

Send a .fatal request, causing a crash.

What version of this project (`swift-aws-lambda-runtime`) are you using?

1.0.0-alpha

Swift version

The lambda is archived in a docker container using the image swift:5.9.0-amazonlinux2, on the ubuntu-latest runner (x86_64 architecture).

Amazon Linux 2 docker image version

swift:5.9.0-amazonlinux2

sebsto commented 4 months ago

Hello,

This illegalInstruction error is most likely due to the fact you compiled for Arm64 and execute on x64 (or the other way around)

If you compile on Apple Silicon machines (M1 or newer), be sure to create a Lamdda function that runs on on Arm64 architecture.

If you use SAM to deploy, there is a a one-line code change in your SAM template :

         Architectures:
            - arm64

If you created your function in the AWS console, there is a similar parameter you can set at function creation time.

DCVortexxx commented 4 months ago

Hi Sébastien,

Thanks for replying.

I don’t think that is it, the architecture do match, and the lambda is working fine most of the time.

However, I do have a race condition or logic error that makes it crash from time to time, and I can’t get any information or stack trace on the AWS console (other than the illegal instruction message).

Since I can’t reproduce locally, it is a pain to debug, and I’m trying to figure out if there is any way to get the stack trace of the crash.

Thanks for your time!

Max.

sebsto commented 4 months ago

@DCVortexxx You're saying that the error is intermitent, and most of the time, it works. That rules out an Architecture mismatch.

You can try to set the environemnt variable LOG_LEVEL=trace in the Lambda environment. The runtime will produce more tracing, maybe the cause will be visible there.

Otherwise, we will need to modify the error handling to produce more verbose output in case of a runtime crash

DCVortexxx commented 4 months ago

Hello @sebsto, and thanks for your reply.
Sorry about the delay, I set the log level and to be honest, it slipped out of my mind for a couple of weeks.

Unfortunately, setting LOG_LEVEL=trace in the environment does indeed increase logs from the lambda runtime SDK, however, it does not include the stack trace of why a function exited with an error.

It is mentioned in the lifecycle management section of the README that:

By default, the library also registers a Signal handler that traps INT and TERM, which are typical Signals used in modern deployment platforms to communicate shutdown request.

What I would expect (or like) is when such a signal is captured, the SDK would provide the developer with sufficient informations about what happened, to fix his own issue.

However, I'm not fully sure how signal trapping works and maybe what I'm asking is impossible.
If so, maybe we could have an environment variable to disable signal trapping, letting the program crash and access the stack trace as we would in any other program that crashes?
Once again, I'm not very experienced in that area, so feel free to correct me if I'm misunderstanding something or if what I'm asking is impossible.

On a side note, I've added (a lot of) logs in my own function as well to help me debug, and I managed to pinpoint the location of the crash.
Still unsure about what happens and how to fix it, but at least there's progress 🙃

sebsto commented 4 months ago

I'm not sure it's possible to print a stacktrace when the binary is compiled in release mode. Binaries typically crash with EXC_BAD_ACCESS error and nothing more. Can you reproduce the crash when executing locally in DEBUG mode ?

Another debug strategy I often use is to capture the raw event (as string) passed to the runtime. Setting LOG_LEVEL=trace should allow you to capture the raw JSON. Then I verify if the JSON can be decoded by the corresponding Lambda Event struct.

Anyway, we're on the verge to rewrite the Lambda runtime to accommodate for Swift 6 strict concurrent and Service lifecycle. I suggest to not change anything related to signal handling in this version but rather take this feedback into consideration for v2. @fabianfett wdyt ?

DCVortexxx commented 3 months ago

I'm not sure it's possible to print a stacktrace when the binary is compiled in release mode. Binaries typically crash with EXC_BAD_ACCESS error and nothing more.

That makes sense indeed.
However, I don't think the final, uploaded binary actually crashes, since as the documentation states, the termination signal is trapped.
So I was thinking that maybe, in that case, the stack trace of where the signal happened would be available somewhere.
Once again, that's only an assumption, I'm definitely not an expert in that field.

Can you reproduce the crash when executing locally in DEBUG mode ?

No I did not manage to reproduce it in debug, but given my logs in production, it happens ~0.05% of the time.
And since my project does not have a lot of users currently, the data is not that easy to get.

Another debug strategy I often use is to capture the raw event (as string) passed to the runtime. Setting LOG_LEVEL=trace should allow you to capture the raw JSON. Then I verify if the JSON can be decoded by the corresponding Lambda Event struct.

Yeah, all good on that side, there's nothing distinctive about the event that could explain it.
With the same event content, 99.95% of the time, the lambda executes and terminates as expected, but 0.05% of the time, the lambda logs Runtime exited with error: signal: illegal instruction.

I managed to narrow it down to a call to URLSession.dataTask(with:completionHandler:).
I have some logs on the line just before that call, and some logs just after it, and the second ones are not shown.
The completion is not called either, of course.

This call is wrapped in a withCheckedContinuation in order to make use of the Swift concurrency, because it is not available on Linux.
I'm currently trying to simply to get rid of this async-await wrapper, and see if it improves things.

Anyway, we're on the verge to rewrite the Lambda runtime to accommodate for Swift 6 strict concurrent and Service lifecycle. I suggest to not change anything related to signal handling in this version but rather take this feedback into consideration for v2. @fabianfett wdyt ?

That definitely makes sense. Thanks again for your time!

swift-server / swift-aws-lambda-runtime