ripgarpr / garpr

the revival of GarPR. may gar rest in RIP
16 stars 14 forks source link

handle OOM errors gracefully #84

Open jessemtso opened 8 years ago

jessemtso commented 8 years ago

when we had a lot of users hit the site due to posting in NYC SSBM, mongo OOM'd, which led to OOM killer killing the API.

we need to:

1) handle OOM errors gracefully, restarting things when necessary 2) rework the way we're using mongo (cacheing maybe?) so that we don't blow memory

jessemtso commented 8 years ago

lets see if @jiangtyd has any ideas. he likes algorithms and optimizing

jschnei commented 8 years ago

Okay, I made some changes that should keep the server stable while I'm travelling this week (although there are still issues to be dealt with).

Upstart now monitors the twistd services and automatically restarts them when they go down (with the exception that if they go down more than 10 times in a period of 90 seconds, it'll stop trying to restart; in this case something is seriously wrong). I couldn't get upstart to interface with jenkins easily, but there's a cron job that checks if jenkins is running every hour and restarts it if it isn't. (85d231e95c981bba4c8179c6a42ad0075e176846 a3c0c6ada3fa98614a023d05f7ceab29f882e268)

Also, #85 changes the frontend so that we query the local player list when searching for players in the drop-down box instead of repeatedly querying the server; this should take a significant load off the server.

Things to do:

BrandonCookeDev commented 8 years ago

did anything ever happen here?

jschnei commented 8 years ago

Yes: load testing was implemented at some point via loader.io. Right now the api can handle about 1 request/sec but will crash at around 2 requests per second. Here are some fancy graphs:

Load testing with 1 request/sec Load testing with 2 requests/sec

Right now the biggest outstanding issue is removing the raw field from tournaments (#102) ; now that the ORM stuff is basically done, I'll probably push something for that and then rerun the load tests. Generally, since stuff restarts itself when it crashes, the site should be "usable" under heavy load (but might drop some responses when OOM occurs).

Fixing the raw field should improve things a bunch, but moving to a server with slightly more RAM is still something we should look into/consider if we get enough traffic. The best deal I can find is Linode servers where we can get servers with 2GB of RAM (currently we only have 1GB) and other decent specs for $10/month (AWS/DigitalOcean/Google Compute all seemed more expensive). But for now we should be fine.

BrandonCookeDev commented 6 years ago

Is this still an issue or can it be closed? @jschnei