Closed wgwz closed 2 years ago
Hey team! Please add your planning poker estimate with ZenHub @wgwz @mhagel
I believe that we need to allocate more memory resources to the heroku instances. When I run the registry-server on my local computer, it using ~870Mb of memory at idle. We can observe a similar amount of memory usage in the staging heroku app. This makes sense because roughly speaking, that app is probably "idling" for the most part as well.
You can see that the staging application is running at more than 150% of it's maximum memory. It is cycling to above 200%, at which point heroku kills the dyno and restarts it.
It is possible that there is a memory leak that explains this cycling. After all, there is a clear pattern that looks like linear increase. However, I think we before we investigate that, it makes sense to upgrade the dynos so that ideally they are idling at < 50% max memory. It's a red flag that at idle we are already at greater than 150% memory usage.
Production has 2x the available memory that the staging environment has but it still is idling at 100%:
In terms of good next steps for this, one option would be good to double the available memory to both environments. Ideally we could upgrade both environments to Performance M ($250/dyno per month but easy). Alternatively, we could dive deeper to see how we can reduce the memory usage at idle (difficult). We could also consider hosting the application on different infrastructure (non-trivial migration involved).
I know that we have some suspicion that the redis memory issues may improve this situation. I'm thinking I will work on that issue now, and then we can see if it makes a difference.
Updating with new observations from the staging environment after upgrading redis (via #867)
The CPU usage for the Heroku instance is still running at the same memory usage level prior to the upgrade (~188%). #867 included a fix for a memory leak in a library we were using, and so far it seems like the app is not restarting itself as much. However, if it goes about 200% again it will crash and restart (it still gets quite close to the 200% limit).
After thinking about resolutions for this, one thing we could consider is to upgrade staging to the standard 2x dyno while production dyno could be upgraded to the performance M type. This would make the crashes much less frequent in the staging environment, and the same would obviously be true for production.
https://www.heroku.com/dynos https://devcenter.heroku.com/articles/dyno-types
That said, I think the ideal would be to have both environments provisioned the same way. And of course we can still pursue other options from my prior comment if we think those are preferable.
I could not confirm or deny that our image caching layer is actually the culprit for eating up application memory, in scenario where redis is nearly at capacity. However since we are increasing the memory available to redis, this scenario becomes less likely. Furthermore, there was actually a memory leak in the image caching layer that was fixed in #867. I think that with those changes, as well as adjusting the available memory in heroku dynos, we should see much fewer crashes.
to summarize:
perhaps we can wait to see how the system performs and create a new task to implement one of the above options, when necessary, at a later time.
@blushi @mhagel @haveanicedavid what do you think?
in the mean time, maybe we can just close this then? @wgwz
@blushi sounds good!
Is your feature request related to a problem? Please describe.
The registry-server crashes when a large image file gets uploaded. It seems that when our redis instance reaches capacity, the imaging caching layer that exists in the registry-server begins to use memory on the web server. Due to that, the web server crashes (Heroku throws the R15 error). This is evidenced in the logs on the day that we last saw the application crash:
Describe the solution you'd like
Describe alternatives you've considered (optional)
There are probably some options to consider in how to achieve item 1 from the list above.
Additional context (optional) Relates to the work to resize our redis instance: https://github.com/regen-network/regen-registry/issues/867
Chat: https://regen-network.slack.com/archives/C01LX9E8QN8/p1650380330865139
c.c. @blushi @mhagel
For Admin Use