Closed pushred closed 10 years ago
Workers are instantiated by this method: https://github.com/SparkartGroupInc/solidus/blob/master/lib/preprocessor.js#L52-L59
Preprocessor workers are the only child processes that exist in Solidus. These processes are also automatically reaped and restarted when they encounter an error or hang. It's possible there may be some problem with https://github.com/rvagg/node-worker-farm that causes processes not to be properly reaped. I'm also not completely sure if the graphs Modulus is giving us keep track of resource usage in child processes, so there's a possibility they might be locking the instance up. Presently workers can only take 100 calls before they're automatically reaped and restarted.
I'm still convinced that the problem is not within Solidus itself, due to the fact that this happens across all of our Solidus sites at once. All of our sites share other common threads; using the Storyteller.io proxy, the same DNS service, running on Modulus, fronted by Edgecast, etc etc. Isn't it more likely the problem lies in one of those places?
As @Fauntleroy is saying, I would be surprised if the problem is with Solidus itself. Unless we reach a limit of some kind in Modulus with all sites at the same time, or in a library we're using.
I suggest we add robustness to resources fetching first, so the servers can survive these kinds of problems, then we can dig deeper and fix the actual problems. The robustness changes will help all sites anyway, nothing is lost.
I think what we need with Modulus is more ammunition to show them that we think something is wrong on their end. They've been pointing the finger at Storyteller since the errors are timeouts. But Richard did acknowledge the possibility of a network-related error. The frequency of that just seems odd however, why would requests to the proxy continuously timeout in this way?
Was the sparkart.com site timeouting too? If so, then it's definitely not the proxy's fault, since it uses the old storyteller: https://github.com/SparkartGroupInc/sparkart.com/blob/master/views/index.hbs#L5-L6
Yup, this is what those resources looked like:
Totally missed that they were still using our "proxy" site! I'll switch those over soon.
Here is the Pingdom data for the Modulus sites, for March 25th, between 6pm and midnight:
bravado.com
18:00–19:00 1,948 ms
19:00–20:00 1,841 ms
20:00–21:00 1,987 ms
21:00–22:00 3,772 ms
22:00–23:00 3,401 ms
23:00–24:00 1,915 ms
donnyosmond.com
18:00–19:00 693 ms
19:00–20:00 639 ms
20:00–21:00 672 ms
21:00–22:00 810 ms
22:00–23:00 718 ms
23:00–24:00 580 ms
immunityproject.org
18:00–19:00 362 ms
19:00–20:00 358 ms
20:00–21:00 319 ms
21:00–22:00 3,001 ms
22:00–23:00 2,347 ms
23:00–24:00 337 ms
keithurban.net
18:00–19:00 2,136 ms
19:00–20:00 2,045 ms
20:00–21:00 1,849 ms
21:00–22:00 4,331 ms
22:00–23:00 2,950 ms
23:00–24:00 2,379 ms
sparkart.com
18:00–19:00 450 ms
19:00–20:00 455 ms
20:00–21:00 443 ms
21:00–22:00 3,546 ms
22:00–23:00 3,021 ms
23:00–24:00 422 ms
And the Storyteller apis:
proxy.storyteller.io
100% uptime
18:00–19:00 338 ms
19:00–20:00 326 ms
20:00–21:00 339 ms
21:00–22:00 361 ms
22:00–23:00 331 ms
23:00–24:00 320 ms
api.storytellerhq.com
100% uptime
18:00–19:00 244 ms
19:00–20:00 272 ms
20:00–21:00 211 ms
21:00–22:00 242 ms
22:00–23:00 225 ms
23:00–24:00 214 ms
It's pretty clear that something happened to the Modulus sites (except donnyosmond.com) while nothing happened to the Storyteller apis. Am I missing any other Modulus site?
There’s been a recurring issue where resource requests suddenly start to fail across all of our sites hosted with Modulus. This usually triggers a period of downtime due to very high responses times. The most recent incident occurred on March 26, lasting for an hour around 9:30-10:30. Attempts to restart the Modulus servers, re-deploy, etc. didn’t have an effect. Recovery usually happens on it’s own, quickly and suddenly. Possibly when Modulus kills processes that may be at fault. In this incident they were not seeing anything unusual on their end.
Pingdom looked like:
Modulus looked like:
I can provide the full logs on request. I believe the
ETIMEDOUT
error was the core issue, with the high page response times and preprocessor errors being symptomatic.Storyteller.io
Resources on sites are generally proxied via Storyteller.io. Modulus suspected issues on our end, but checking our own systems nothing seemed to be wrong. Heroku reported no issues for the day or the ~2 days around it. Pingdom records 100% uptime for the period. Logs from the period showed no requests for site-related resources, starting around 9:36pm which correlates to the longest period of downtime recorded by Pingdom (36m):
The last request from a site was here:
Requests from the
hipster-tools
app for resources used there continued however:Random requests from Pingdom were also successful:
The next successful request from a site comes here:
Pingdom was making successful requests around this time before another downtime event of 18 minutes starting at 05:13 and lasting until 05:31. The site remained operational for another 15 hours from there, with a 1 minute flicker, and then 11+ days of uptime since.
If this was an intermittent network issue that could explain why our app’s requests continued to work while outside requests did not. The Pingdom checks of our status endpoint aren’t quite the same, but they definitely don’t support faulting Storyteller.io in these issues.
Child Processes
There’s something of a pattern to this happening in the late night/early morning hours. The incident prior occurred March 12 from 12:19am with non-stop flapping all the way overnight until 8am, totaling 5 hours 43 minutes. Before this February 28 had 43 minutes of downtime around 11:40-12:26am. Another pattern are long periods of 1-5 minute flapping that culminate in longer periods of downtime that then suddenly recover. Timestamps can be seen from Pingdom.
In the case of March 12 Modulus was experiencing an issue where a percentage of servos failed to start due to EADDRINUSE error, so re-deploying wasn’t even possible without their intervention. They had this to say.
It’s possible this incident was unrelated, but it does make me wonder about the child processes we spawn. Charlie previously showed us some examples of zombie processes they saw spawned from Solidus servers. Haven’t heard of further incidents though. This may have predated the switch to node-worker-farm.
Are any child processes spawned for resource requests?
Resiliency
We may not be able to identify the exact cause of this issue if it is truly intermittent network latency. But if this is the case I think we need more resilient resource fetching that doesn’t bring an entire site down. In Orator a site continued to load if resource requests failed, leading to issues of disappearing content and such but otherwise maintained some level of uptime.
There is a 3.5 second timeout for resources, but is the fetching synchronous? I’m wondering why the rest of the site doesn’t respond if all the resources timeout around the same time.