Prevent image caching from crashing registry-server

wgwz commented 2 years ago

Is your feature request related to a problem? Please describe.

The registry-server crashes when a large image file gets uploaded. It seems that when our redis instance reaches capacity, the imaging caching layer that exists in the registry-server begins to use memory on the web server. Due to that, the web server crashes (Heroku throws the R15 error). This is evidenced in the logs on the day that we last saw the application crash:

the app is handling a request for a mai ndombe image (which succeeds)
the apps memory is running at 177% at this point (and prior to these logs it was like this and handling requests successfully)
then after this sequence of requests (including some POST requests) the app memory jumps 250%
then heroku throws Error R15 (Memory quota vastly exceeded) which puts the app into a crashed state

Describe the solution you'd like

When the redis instance reaches capacity or is just unavailable to the web server, the web server should not cache images, consume unbounded amounts of memory and subsequently crash. If we cannot cache images, we should prefer serving uncached content or even throwing 404's (or similar) to having the application crash.
Additionally, it would be nice to take care of #867 but that is still a separate issue.

Describe alternatives you've considered (optional)

There are probably some options to consider in how to achieve item 1 from the list above.

Additional context (optional) Relates to the work to resize our redis instance: https://github.com/regen-network/regen-registry/issues/867

Chat: https://regen-network.slack.com/archives/C01LX9E8QN8/p1650380330865139

c.c. @blushi @mhagel

For Admin Use

[x] Not duplicate issue
[ ] Appropriate labels and zenhub epics applied
[ ] Appropriate contributors tagged

blushi commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @wgwz @mhagel

wgwz commented 2 years ago

I believe that we need to allocate more memory resources to the heroku instances. When I run the registry-server on my local computer, it using ~870Mb of memory at idle. We can observe a similar amount of memory usage in the staging heroku app. This makes sense because roughly speaking, that app is probably "idling" for the most part as well.

You can see that the staging application is running at more than 150% of it's maximum memory. It is cycling to above 200%, at which point heroku kills the dyno and restarts it.

It is possible that there is a memory leak that explains this cycling. After all, there is a clear pattern that looks like linear increase. However, I think we before we investigate that, it makes sense to upgrade the dynos so that ideally they are idling at < 50% max memory. It's a red flag that at idle we are already at greater than 150% memory usage.

Production has 2x the available memory that the staging environment has but it still is idling at 100%:

In terms of good next steps for this, one option would be good to double the available memory to both environments. Ideally we could upgrade both environments to Performance M ($250/dyno per month but easy). Alternatively, we could dive deeper to see how we can reduce the memory usage at idle (difficult). We could also consider hosting the application on different infrastructure (non-trivial migration involved).

I know that we have some suspicion that the redis memory issues may improve this situation. I'm thinking I will work on that issue now, and then we can see if it makes a difference.

wgwz commented 2 years ago

Updating with new observations from the staging environment after upgrading redis (via #867)

Screenshot_2022-05-04_13-35-07

The CPU usage for the Heroku instance is still running at the same memory usage level prior to the upgrade (~188%). #867 included a fix for a memory leak in a library we were using, and so far it seems like the app is not restarting itself as much. However, if it goes about 200% again it will crash and restart (it still gets quite close to the 200% limit).

After thinking about resolutions for this, one thing we could consider is to upgrade staging to the standard 2x dyno while production dyno could be upgraded to the performance M type. This would make the crashes much less frequent in the staging environment, and the same would obviously be true for production.

https://www.heroku.com/dynos https://devcenter.heroku.com/articles/dyno-types

That said, I think the ideal would be to have both environments provisioned the same way. And of course we can still pursue other options from my prior comment if we think those are preferable.

wgwz commented 2 years ago

I could not confirm or deny that our image caching layer is actually the culprit for eating up application memory, in scenario where redis is nearly at capacity. However since we are increasing the memory available to redis, this scenario becomes less likely. Furthermore, there was actually a memory leak in the image caching layer that was fixed in #867. I think that with those changes, as well as adjusting the available memory in heroku dynos, we should see much fewer crashes.

wgwz commented 2 years ago

to summarize:

even with https://github.com/regen-network/regen-registry/issues/867 merged, it's still possible that the registry-server crashes for the same reason (however it is less likely now, since redis has more memory).
i think one of these options is probably a reasonable way to address this:
- upgrade staging dyno to "standard 2x" and production dyno to "performance M".
- upgrade both staging and prod to "performance M".

perhaps we can wait to see how the system performs and create a new task to implement one of the above options, when necessary, at a later time.

@blushi @mhagel @haveanicedavid what do you think?

wgwz commented 2 years ago

blushi commented 2 years ago

in the mean time, maybe we can just close this then? @wgwz

wgwz commented 2 years ago

@blushi sounds good!

regen-network / rnd-dev-team

Prevent image caching from crashing registry-server #912

For Admin Use