handle OOM errors gracefully

jessemtso commented 8 years ago

when we had a lot of users hit the site due to posting in NYC SSBM, mongo OOM'd, which led to OOM killer killing the API.

we need to:

1) handle OOM errors gracefully, restarting things when necessary 2) rework the way we're using mongo (cacheing maybe?) so that we don't blow memory

jessemtso commented 8 years ago

lets see if @jiangtyd has any ideas. he likes algorithms and optimizing

jschnei commented 8 years ago

Okay, I made some changes that should keep the server stable while I'm travelling this week (although there are still issues to be dealt with).

Upstart now monitors the twistd services and automatically restarts them when they go down (with the exception that if they go down more than 10 times in a period of 90 seconds, it'll stop trying to restart; in this case something is seriously wrong). I couldn't get upstart to interface with jenkins easily, but there's a cron job that checks if jenkins is running every hour and restarts it if it isn't. (85d231e95c981bba4c8179c6a42ad0075e176846 a3c0c6ada3fa98614a023d05f7ceab29f882e268)

Also, #85 changes the frontend so that we query the local player list when searching for players in the drop-down box instead of repeatedly querying the server; this should take a significant load off the server.

Things to do:

Set up some sort of automated load testing/server monitoring: maybe using an external service like loader.io . It would be good to know exactly what sort of queries are causing the site to explode.
I suspect that queries to the tournament collection are particularly memory intensive since the tournament objects in mongo carry around the raw field with them. I suggest moving the raw field to its own collection(/ to disk?) and just storing a pointer in the tournament object.
Optimize frontend a bunch. The dropdown menu was by far the worst offender, but there are a bunch of places where we can cut down on the number of requests/use data that we've already locally cached.
Move to a server with more memory? Definitely not a pressing need, but we should keep it in mind for the future. The server is running 4 web processes (prod/stage api/webapp) along with jenkins and mongo, so it is a sizeable load.

BrandonCookeDev commented 8 years ago

did anything ever happen here?

jschnei commented 8 years ago

Yes: load testing was implemented at some point via loader.io. Right now the api can handle about 1 request/sec but will crash at around 2 requests per second. Here are some fancy graphs:

Load testing with 1 request/sec Load testing with 2 requests/sec

Right now the biggest outstanding issue is removing the raw field from tournaments (#102) ; now that the ORM stuff is basically done, I'll probably push something for that and then rerun the load tests. Generally, since stuff restarts itself when it crashes, the site should be "usable" under heavy load (but might drop some responses when OOM occurs).

Fixing the raw field should improve things a bunch, but moving to a server with slightly more RAM is still something we should look into/consider if we get enough traffic. The best deal I can find is Linode servers where we can get servers with 2GB of RAM (currently we only have 1GB) and other decent specs for $10/month (AWS/DigitalOcean/Google Compute all seemed more expensive). But for now we should be fine.

BrandonCookeDev commented 6 years ago

Is this still an issue or can it be closed? @jschnei

ripgarpr / garpr

handle OOM errors gracefully #84