Open jmcmurry opened 9 years ago
Load tests are important, but I think this probably needs to be unpacked into a few different items: unit tests sets, tests framework for pushing the services until they overload, fuzz testing, etc. Also, is it certain that it was an overload as opposed to some other type of issue, and which parts caused the load?
As well, since the desired outcome is to not have things go down, coordination with production about things like load balancing and failover options would go a long way to preventing disruptions.
I looked at google analytics and it seemed that there were only 70 counted users today, compared to over a hundred on most days recently. So either Google analytics hasn't finished counting, or it wasn't the load from us having presented.
Also, given the outages, there should be an external turnkey solution for restarting the servers. As well, something that we've had in AmiGO due to similar circumstances is a dead man's switch so that traffic can be redirected to a secondary site (e.g., sending people to beta would probably be better than a 500 error).
Part of the current architecture is that there is a single (now node) app running that contacts a bunch of services. If you allow a more robust system to sit in front of it (apache, nginx, etc.), even if the app goes down, you can have control and fallback solutions.
We urgently need to develop some load tests. We can all agree that the servers can not go down, especially when we have spikes of usage during after presentations. Who is the right person to take this on?