Closed rsanheim closed 4 years ago
Ran the cache warm job on Sandbox on the current instance type (no upgrade yet):
deploy@ec2-13-235-33-14:~/apps/simple-server/current$ time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"
real 2m49.548s
user 0m17.600s
sys 0m0.616s
So just under 3 minutes. CPU on the DB spiked of course...the strange thing is it hasn't come back down yet:
Could be a lag in reporting, or could be that we are just hammering the DB out of memory and it degrades performance?
Started the maintenence on sandbox at 11:57 Central time. Upgrading to a db.r4.xlarge
, which expands our size to 4 vCPUs and 30.5 GiB. There is quite the difference between the two.
The upgrade took about 10 minutes - hard to tell from the AWS logs because it just shows the start of the config change, and not when things are finished. Pretty fast though.
I ran the cache job again after, and the run time wasn't super improved but the CPU usage was quite a bit better:
deploy@ec2-13-235-33-14:~/apps/simple-server/current$ time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"
real 2m36.927s
user 0m17.060s
sys 0m0.580s
That first spike is the smaller instance, the second spike is the new db.r4.xlarge.
Upgraded production replica DB to a r4.2xlarge, took under 10 minutes.
Look at all that CPU headroom we got around 18:00 when the instance size changed over. ✨ ✨ ✨
@simpledotorg/server-developers I've updated the body of this issue with latest status and next steps. Nilenso team can take over when they are online with the last few remaining steps that require production access, but I think we are real close to at least a medium-term solution for our database / caching woes. 😁
The Reports pages are consistently taking ~4-5 seconds to load post cache-warmup. This I suspect will be further brought down due to the missing cache on this: https://github.com/simpledotorg/simple-server/pull/1260 – I'm confident we can reach a good 1-2 second load time on this.
The DB Utilization hovered around 40-50% for 4.3 hours during the warmup with nothing else running.
With the appointment reminders in between, it shot up to ~70% for ~10 minutes and then came back.
Just merged https://github.com/simpledotorg/simple-server/pull/1269, which using a faster query and adds indexes on the main mat view we are suing.
Cuts the cache time quite a bit on sandbox:
time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"
real 1m11.690s
user 0m16.340s
sys 0m0.600s
Will be monitoring on production when it runs shortly here.
BD production looks really good:
[BD_production] Finished Facility caching in 4 seconds, total cache time was 8 seconds.
[BD_production] Reports::RegionCacheWarmer All done!
IN prod looks much better - down to 33 minutes from 4.5 hours - over an 8x speedup. 🚀🚀🚀
[IN_production] Finished Facility caching in 1771 seconds, total cache time was 2011 seconds.
[IN_production] Reports::RegionCacheWarmer All done!
Thinking we can call this good for now - this is fast enough for now.
If someone could grab the database CPU graph from Cloudwatch and throw it up here to compare the past three days, that would be helpful. I'd like to see how much better the latest query is in terms of not stressing the database.
Closing this issue, since it's resolved.
We had a DB related outage last night on August 27th. This is a tracking issue to discuss remediation steps, fixes, and to track progress.
References
Checklist
db.r5
~r4.xlarge
type database on sandbox -- these are the "latest generation memory optimized" instance types, and are supported for PostgreSQL. Note that we couldn't use the r5 or higher because our Postgres version isn't recent enoughr4.x2large
instancedeployment
PRsdeployment
the new slack webhook ENV var to production https://github.com/simpledotorg/deployment/pull/246 (this is required for the next step!)simple-server
so that we get https://github.com/simpledotorg/simple-server/pull/1258 deployed - kit/hari/reports/regions
- kit/hari - left some notes here.last refreshed at
timestamp on the My Facilities pages and inform the CVHOs. https://github.com/simpledotorg/simple-server/pull/1259 - kitRollback plan
disable_region_cache_warmer
feature in Flipper in production, which should bypass the cache warming - see https://github.com/simpledotorg/simple-server/pull/1257/files#diff-e5a1a9711a94012bc35fc3a30e3c2d86R28