DB perf tracking issue - Githubissues

rsanheim commented 4 years ago

We had a DB related outage last night on August 27th. This is a tracking issue to discuss remediation steps, fixes, and to track progress.

References

Checklist

[x] back off on materialized view refresh
[x] measure time to refresh caches on sandbox before any changes
[x] look into what it takes to upgrade Postgres RDS instance size
[x] upgrade to a ~db.r5~ r4.xlarge type database on sandbox -- these are the "latest generation memory optimized" instance types, and are supported for PostgreSQL. Note that we couldn't use the r5 or higher because our Postgres version isn't recent enough
[x] verify
[x] repeat for prod but use the r4.x2large instance
[x] turn back on region cache warming (with flipper toggle) https://github.com/simpledotorg/simple-server/pull/1257 NOTE: I got blocked here because I needed to deploy a new ENV var to production, and that requires SSH access to ship deploymentPRs
[x] Merge and deploy latest deployment the new slack webhook ENV var to production https://github.com/simpledotorg/deployment/pull/246 (this is required for the next step!)
[x] Deploy latest production simple-server so that we get https://github.com/simpledotorg/simple-server/pull/1258 deployed - kit/hari
[x] Run the cache warmer manually on production to verify the DB is fine and that we get notifications in Slack #alerts - kit/hari:
```
RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"
```
[ ] Verify production report page load times -- it should generally be 500 ms or lower for all the pages if they are all in the cache. You can find them starting at /reports/regions - kit/hari - left some notes here.
[x] Update Terraform configuration to be in sync with our new database instance types https://github.com/simpledotorg/deployment/commit/ef29a2231bb2a0b7882c06246374b441dea979b7 - prabhanshu
[x] Stick a last refreshed at timestamp on the My Facilities pages and inform the CVHOs. https://github.com/simpledotorg/simple-server/pull/1259 - kit

Rollback plan

Turn on disable_region_cache_warmer feature in Flipper in production, which should bypass the cache warming - see https://github.com/simpledotorg/simple-server/pull/1257/files#diff-e5a1a9711a94012bc35fc3a30e3c2d86R28

rsanheim commented 4 years ago

Ran the cache warm job on Sandbox on the current instance type (no upgrade yet):

deploy@ec2-13-235-33-14:~/apps/simple-server/current$ time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"

real    2m49.548s
user    0m17.600s
sys 0m0.616s

So just under 3 minutes. CPU on the DB spiked of course...the strange thing is it hasn't come back down yet:

Could be a lag in reporting, or could be that we are just hammering the DB out of memory and it degrades performance?

rsanheim commented 4 years ago

Started the maintenence on sandbox at 11:57 Central time. Upgrading to a db.r4.xlarge, which expands our size to 4 vCPUs and 30.5 GiB. There is quite the difference between the two.

rsanheim commented 4 years ago

The upgrade took about 10 minutes - hard to tell from the AWS logs because it just shows the start of the config change, and not when things are finished. Pretty fast though.

I ran the cache job again after, and the run time wasn't super improved but the CPU usage was quite a bit better:

deploy@ec2-13-235-33-14:~/apps/simple-server/current$ time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"

real    2m36.927s
user    0m17.060s
sys 0m0.580s

That first spike is the smaller instance, the second spike is the new db.r4.xlarge.

rsanheim commented 4 years ago

Upgraded production replica DB to a r4.2xlarge, took under 10 minutes.

rsanheim commented 4 years ago

Look at all that CPU headroom we got around 18:00 when the instance size changed over. ✨ ✨ ✨

rsanheim commented 4 years ago

@simpledotorg/server-developers I've updated the body of this issue with latest status and next steps. Nilenso team can take over when they are online with the last few remaining steps that require production access, but I think we are real close to at least a medium-term solution for our database / caching woes. 😁

kitallis commented 4 years ago

The Reports pages are consistently taking ~4-5 seconds to load post cache-warmup. This I suspect will be further brought down due to the missing cache on this: https://github.com/simpledotorg/simple-server/pull/1260 – I'm confident we can reach a good 1-2 second load time on this.

Screenshot 2020-08-28 at 7 07 51 PM

The DB Utilization hovered around 40-50% for 4.3 hours during the warmup with nothing else running.

Screenshot 2020-08-28 at 5 23 50 PM

With the appointment reminders in between, it shot up to ~70% for ~10 minutes and then came back.

Screenshot 2020-08-28 at 2 42 12 PM

rsanheim commented 4 years ago

Just merged https://github.com/simpledotorg/simple-server/pull/1269, which using a faster query and adds indexes on the main mat view we are suing.

Cuts the cache time quite a bit on sandbox:

time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"

real    1m11.690s
user    0m16.340s
sys 0m0.600s

Will be monitoring on production when it runs shortly here.

rsanheim commented 4 years ago

BD production looks really good:

[BD_production] Finished Facility caching in 4 seconds, total cache time was 8 seconds.
[BD_production] Reports::RegionCacheWarmer All done!

rsanheim commented 4 years ago

IN prod looks much better - down to 33 minutes from 4.5 hours - over an 8x speedup. 🚀🚀🚀

[IN_production] Finished Facility caching in 1771 seconds, total cache time was 2011 seconds.
[IN_production] Reports::RegionCacheWarmer All done!

rsanheim commented 4 years ago

Thinking we can call this good for now - this is fast enough for now.

If someone could grab the database CPU graph from Cloudwatch and throw it up here to compare the past three days, that would be helpful. I'd like to see how much better the latest query is in terms of not stressing the database.

kitallis commented 4 years ago

Closing this issue, since it's resolved.

simpledotorg / simple-server

DB perf tracking issue #1255

References

Checklist

Rollback plan