transitmatters / t-performance-dash

TransitMatters performance visualizer for the MBTA
https://dashboard.transitmatters.org/
MIT License
50 stars 17 forks source link

Cold Start Optimization #548

Closed PatrickCleary closed 5 months ago

PatrickCleary commented 1 year ago

Updating this top comment to add Devin's suggestion: serverside caching. Probably the best solution.

Today page takes way too long to load.

Personally, I think it's unacceptably slow at this point. Would like to fix this for v4.0, especially since the home page will likely be a combination of these same queries for each line.

We could make a DB that just has every value needed for home/today page that gets updated every 5-10 minutes

devinmatte commented 1 year ago

There's got to be a way to have some kind of serverside caching we can do without having to spin up a whole DB, some kind of in memory cache

PatrickCleary commented 1 year ago

Just occurred to me, I might be experiencing this because I've been on the West coast for the last 3 weeks and all of our backend is in us-east-1.

DIdn't stand out to me so much previously, likely only when I came out here.

Update: I don't think this is the cause. At most would be a couple hundred ms of latency and the widgets take 5+ seconds to load.

JNuss71 commented 1 year ago

I was exploring just doing memoization with a TTL Cache on the existing alerts() function but the parameters for the function are not hashable which complicates things. It also leads to an inconsistent experience once the TTL expires and the user has to wait again for the API call. This would happen each TTL period for each unique set of parameters. Just dropping in a cache decorator for an existing function might be more useful in other situations or for a different function I am not thinking of.

I still haven't nailed down the right way to do this but I'm imagining just using a python object like a cache (or dict) to store all the events for today and yesterday and then just have a lambda on a cron schedule to keep it updated. There can also potentially be a refresh button that calls the lambda to load the live value outside of the cron schedule. With this the only slowness upon opening the site comes from having to repopulate the object from the API if the lambda goes cold and clears the python state. I think this also requires some refactoring work to utilize the object based cache rather than API calls with the specific parameters passed in but I'll have to explore how the pastalerts actually works first to see how the refactor would work.

Edit: not sure if I understand lambda environments completely so I'll test it out and I also was looking at the wrong api call :|

PatrickCleary commented 1 year ago

I’ve recently noticed that the slowness only occurs on the first load. Also, when switching between lines there is no slowness. Which would be a separate API call

So idk what is causing it, maybe the lambda going cold as you mentioned @JNuss71

devinmatte commented 1 year ago

We can track cold starts in Datadog along with traces: https://app.datadoghq.com/functions?cloud=aws&panel_end=1684110333814&panel_paused=false&panel_start=1684023933814&panel_tab=invocations&selection=aws-lambda-functions%2Bdatadashboard-v4-beta-apihandler-santspvqh3tc%2Bus-east-1%2B473352343756&start=1683937491077&end=1684110291077&paused=false @JNuss71 I can get you an invite to take a look at the tracing data

devinmatte commented 1 year ago

Is this still an issue now that the today page is different?

If so still, we may want to consider a cloudfront cache layer https://theburningmonk.com/2019/10/all-you-need-to-know-about-caching-for-serverless-applications/

JNuss71 commented 1 year ago

I think the cold starts are still an issue although the new System homepage somewhat serves as a loading screen. I've still seen the charts still take a couple seconds to show after scrolling down on first load. The Today page is gone but the cold start issue doesn't seem to be specific to that page.

It's unfortunate that Chalice doesn't implement provisioned concurrency or API Gateway caching. I don't know if there's a way to impliment those with our existing code through something like a cloudformation template. Maybe the cloudfront caching may be best.

devinmatte commented 1 year ago

I'm going to move this out of the v4 launch requirements. I suspect there's not much we can do here, and, once we have regular users the cold starts will be less common. We'll leave this here to track and still try and find a solution to make things better. We're also on Python 3.10 now, and that may help slightly