simpledotorg / simple-server

The web app behind Simple.org
MIT License
68 stars 36 forks source link

DB perf tracking issue #1255

Closed rsanheim closed 4 years ago

rsanheim commented 4 years ago

We had a DB related outage last night on August 27th. This is a tracking issue to discuss remediation steps, fixes, and to track progress.

References

Checklist

Rollback plan

rsanheim commented 4 years ago

Ran the cache warm job on Sandbox on the current instance type (no upgrade yet):

deploy@ec2-13-235-33-14:~/apps/simple-server/current$ time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"

real    2m49.548s
user    0m17.600s
sys 0m0.616s

So just under 3 minutes. CPU on the DB spiked of course...the strange thing is it hasn't come back down yet:

image

Could be a lag in reporting, or could be that we are just hammering the DB out of memory and it degrades performance?

rsanheim commented 4 years ago

Started the maintenence on sandbox at 11:57 Central time. Upgrading to a db.r4.xlarge, which expands our size to 4 vCPUs and 30.5 GiB. There is quite the difference between the two.

rsanheim commented 4 years ago

The upgrade took about 10 minutes - hard to tell from the AWS logs because it just shows the start of the config change, and not when things are finished. Pretty fast though.

I ran the cache job again after, and the run time wasn't super improved but the CPU usage was quite a bit better:

deploy@ec2-13-235-33-14:~/apps/simple-server/current$ time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"

real    2m36.927s
user    0m17.060s
sys 0m0.580s
image

That first spike is the smaller instance, the second spike is the new db.r4.xlarge.

rsanheim commented 4 years ago

Upgraded production replica DB to a r4.2xlarge, took under 10 minutes.

rsanheim commented 4 years ago

image

Look at all that CPU headroom we got around 18:00 when the instance size changed over. ✨ ✨ ✨

rsanheim commented 4 years ago

@simpledotorg/server-developers I've updated the body of this issue with latest status and next steps. Nilenso team can take over when they are online with the last few remaining steps that require production access, but I think we are real close to at least a medium-term solution for our database / caching woes. 😁

kitallis commented 4 years ago

The Reports pages are consistently taking ~4-5 seconds to load post cache-warmup. This I suspect will be further brought down due to the missing cache on this: https://github.com/simpledotorg/simple-server/pull/1260 – I'm confident we can reach a good 1-2 second load time on this.

Screenshot 2020-08-28 at 7 07 51 PM

The DB Utilization hovered around 40-50% for 4.3 hours during the warmup with nothing else running.

Screenshot 2020-08-28 at 5 23 50 PM

With the appointment reminders in between, it shot up to ~70% for ~10 minutes and then came back.

Screenshot 2020-08-28 at 2 42 12 PM

rsanheim commented 4 years ago

Just merged https://github.com/simpledotorg/simple-server/pull/1269, which using a faster query and adds indexes on the main mat view we are suing.

Cuts the cache time quite a bit on sandbox:

time RAILS_ENV=production bundle exec rails runner "Reports::RegionCacheWarmer.call"

real    1m11.690s
user    0m16.340s
sys 0m0.600s

Will be monitoring on production when it runs shortly here.

rsanheim commented 4 years ago

BD production looks really good:

[BD_production] Finished Facility caching in 4 seconds, total cache time was 8 seconds.
[BD_production] Reports::RegionCacheWarmer All done!
rsanheim commented 4 years ago

IN prod looks much better - down to 33 minutes from 4.5 hours - over an 8x speedup. 🚀🚀🚀

[IN_production] Finished Facility caching in 1771 seconds, total cache time was 2011 seconds.
[IN_production] Reports::RegionCacheWarmer All done!
rsanheim commented 4 years ago

Thinking we can call this good for now - this is fast enough for now.

If someone could grab the database CPU graph from Cloudwatch and throw it up here to compare the past three days, that would be helpful. I'd like to see how much better the latest query is in terms of not stressing the database.

kitallis commented 4 years ago

Closing this issue, since it's resolved.