Live stage downtime - Githubissues

chrisandrewcl commented 3 weeks ago

Sometimes I notice my lambdas are called but no invocation is logged in the local sst dev log. It is not that the log fails, but rather the connection between the lambda and the local server that silently fails. Just restarting the sst dev process fixes the issue, but it is not always clear when it is happening.

Any ideas why this happens?

Also, given usage reports and observed behavior, it seems that the lambdas that failed to call home keep running doing nothing up until their timeout, which in some cases, is very wasteful. Even if the server is restarted, running lambdas wait all their timeout before trying to connect again.

Maybe this can be improved somehow?

Some ideas:

Short bridge timeout, aborting earlier with clear feedback
When the local server restarts, if the bridge is still running it should connect right away
Make the bridge return ok to avoid stuck retry loops
Sum of the above, but configurable

* This has been happening with several versions, but I have yet to test 0.0.403. ** Not sure if it is the correct term, but by "bridge" I mean the code that the lambda executes in live mode to connect with the local server.

jayair commented 2 weeks ago

Yeah I guess let's first figure out why the connection is getting dropped. Can you share some logs for when that happens? Both on the Cloudwatch side and locally?

Or if you have any clues as to when this happens. Are you leaving the CLI running overnight?

chrisandrewcl commented 2 weeks ago

Can you share some logs for when that happens? Both on the Cloudwatch side and locally?

Ok, I'll look for it next time it happens.

Or if you have any clues as to when this happens. Are you leaving the CLI running overnight?

Not overnight, just normal usage for a few hours.

* But some of the suggestions above came from an unfortunate incident from my initial experimentations where I left in a hurry and didn't notice the sst remove failed, so a single lone sqs message with a consumer without proper redrive kept tearing my dev account free tier and gave me an unexpected bill, so it would be nice to have a more budget-friendly behavior when the dev server for a live stage is not running. If you prefer to focus this issue on the disconnection issue, I can open another one for this part instead. Please let me know.

brapifra commented 2 weeks ago

It disconnects quite often for me too, not sure why.

✓  No changes
   api: https://******.lambda-url.us-east-1.on.aws/

time=2024-06-06T15:31:54.429+02:00 level=INFO msg="INFO unlocking app=*** stage=***"
time=2024-06-06T15:31:54.850+02:00 level=INFO msg="file event" path=***/sst-env.d.ts op=CHMOD
time=2024-06-06T15:31:54.850+02:00 level=INFO msg=publishing type=*watcher.FileChangedEvent
time=2024-06-06T15:31:54.850+02:00 level=INFO msg="checking if code needs to be rebuilt" file=***/sst-env.d.ts
time=2024-06-06T15:31:54.971+02:00 level=INFO msg="waiting for file changes"
time=2024-06-06T15:32:39.140+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:32:39.140+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:32:40.111+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:33:40.115+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:33:40.117+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:33:41.021+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:34:41.022+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:34:41.022+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:34:42.023+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:35:42.029+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:35:42.030+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:35:42.942+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:36:42.981+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:36:42.981+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:36:44.820+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:36:46.499+02:00 level=INFO msg="mqtt reconnecting"

EDIT: It might have been related to the fact that I forgot to turn off my VPN 🙃 . Checking now if it works as expected without it.

thdxr commented 2 weeks ago

hey was this resolved?

chrisandrewcl commented 1 week ago

@thdxr Not sure, it looks like it was intermittent, but I have yet to experience it again since opening this issue. Was it addressed somehow in the later versions?

jayair commented 1 day ago

Might be, I'll close this for now. Feel free to reopen.

sst / ion

Live stage downtime #514