Resource timeouts in the witching hours

pushred commented 10 years ago

There’s been a recurring issue where resource requests suddenly start to fail across all of our sites hosted with Modulus. This usually triggers a period of downtime due to very high responses times. The most recent incident occurred on March 26, lasting for an hour around 9:30-10:30. Attempts to restart the Modulus servers, re-deploy, etc. didn’t have an effect. Recovery usually happens on it’s own, quickly and suddenly. Possibly when Modulus kills processes that may be at fault. In this incident they were not seeing anything unusual on their end.

BusyRich: well there are no other reports of issues and I just looked up a few projects I have and they all seem to be working well BusyRich: my guess would be network hiccup somewhere along the line BusyRich: could have been somewhere in Modulus, but it could have been somewhere outside BusyRich: as unspecific as that is

Pingdom looked like:

Modulus looked like:

modulus

[SOLIDUS] /about served in 5001ms
[SOLIDUS] /about served in 5001ms
[SOLIDUS] /tour served in 5000ms
[SOLIDUS] /tour served in 5002ms
[SOLIDUS] /tour served in 5004ms
[SOLIDUS] /music served in 5002ms
[SOLIDUS] Error retrieving resource "news": Error: connect ETIMEDOUT
[SOLIDUS] /news resources fetched in 81874ms
[SOLIDUS] Error retrieving resource "code-tips": Error: connect ETIMEDOUT
[SOLIDUS] /tour resources fetched in 80720ms
[SOLIDUS] Error retrieving resource "tags": Error: connect ETIMEDOUT
[SOLIDUS] Error retrieving resource "biography": Error: connect ETIMEDOUT
[SOLIDUS] /about resources fetched in 93255ms
[SOLIDUS] /about preprocessed in 0ms
[SOLIDUS] /news preprocessed in 122ms
[SOLIDUS] Preprocessor Error:
TypeError: Cannot read property 'events' of undefined
    at module.exports (/mnt/data/1/preprocessors/add_tour_metadata.js:65:31)
    at module.exports (/mnt/data/1/preprocessors/tour.js:8:12)
    at module.exports (/mnt/data/1/node_modules/solidus/lib/preprocessor_worker.js:13:13)
    at handle (/mnt/data/1/node_modules/solidus/node_modules/worker-farm/lib/child/index.js:37:8)
    at process.<anonymous> (/mnt/data/1/node_modules/solidus/node_modules/worker-farm/lib/child/index.js:43:3)
    at process.EventEmitter.emit (events.js:98:17)
    at handleMessage (child_process.js:318:10)
    at Pipe.channel.onread (child_process.js:345:11)

I can provide the full logs on request. I believe the ETIMEDOUT error was the core issue, with the high page response times and preprocessor errors being symptomatic.

Storyteller.io

Resources on sites are generally proxied via Storyteller.io. Modulus suspected issues on our end, but checking our own systems nothing seemed to be wrong. Heroku reported no issues for the day or the ~2 days around it. Pingdom records 100% uptime for the period. Logs from the period showed no requests for site-related resources, starting around 9:36pm which correlates to the longest period of downtime recorded by Pingdom (36m):

The last request from a site was here:

Hipster 54.204.62.0 User    Notice  app/web.2   [2:INFO] 2014-03-26 04:22:00 :: GET :  /facebook/573754896050800/attending, {"fields"=>"id,picture,name", "limit"=>"24", "summary"=>"1"} 
Hipster 54.204.62.0 User    Notice  app/web.2   [2:INFO] 2014-03-26 04:22:00 :: CACHE: expired for b68be79517249e18236ca51edc67da60bf4f5e3c43e43f6838f75bb86a9b7c5c.v5, cached at 2014-03-26 04:16:08 +0000 
Hipster 107.20.21.194   Local3  Info    heroku/router   at=info method=GET path=/facebook/573754896050800/attending?fields=id%2Cpicture%2Cname&limit=24&summary=1 host=proxy.storyteller.io request_id=08f420bc-a44f-4fa6-be45-9b1ee5386d22 fwd="54.208.92.74, 108.161.254.221" dyno=web.2 connect=1ms service=202ms status=200 bytes=1938 
Hipster 54.204.62.0 User    Notice  app/web.2   [2:INFO] 2014-03-26 04:22:00 :: Status: 200, Content-Length: 1604, Response Time: 198.21ms 
Hipster 54.204.62.0 User    Notice  app/web.2   [2:INFO] 2014-03-26 04:22:00 :: CACHE: updated for b68be79517249e18236ca51edc67da60bf4f5e3c43e43f6838f75bb86a9b7c5c.v5

Requests from the hipster-tools app for resources used there continued however:

Hipster 107.22.71.183   User    Notice  app/web.1   [2:INFO] 2014-03-26 04:22:43 :: GET :  /twitter/statuses/user_timeline.json, {"screen_name"=>"storytellerhq"} 
Hipster 107.20.21.194   Local3  Info    heroku/router   at=info method=GET path=/twitter/statuses/user_timeline.json?screen_name=storytellerhq host=proxy.storyteller.io request_id=ff897efc-687a-4ab1-9bd5-29b8c5c8b064 fwd="85.17.156.99, 192.16.15.54" dyno=web.1 connect=2ms service=116ms status=200 bytes=8132 
Hipster 107.22.71.183   User    Notice  app/web.1   [2:INFO] 2014-03-26 04:22:43 :: Status: 200, Content-Length: 7834, Response Time: 77.96ms

Random requests from Pingdom were also successful:

Hipster 54.204.210.73   Local3  Info    heroku/router   at=info method=HEAD path=/status host=hipster-api.herokuapp.com request_id=57a1c6f8-16c1-43da-880e-521281fb79fe fwd="50.31.164.139" dyno=web.1 connect=1ms service=4ms status=200 bytes=198

The next successful request from a site comes here:

Hipster 107.22.127.58   User    Notice  app/web.1   [2:INFO] 2014-03-26 05:14:50 :: GET :  /wordpress-json-api-plugin/keithurban.wordpress.sparkart.com/get_tag_posts, {"slug"=>"spotlight", "count"=>"12"}

Pingdom was making successful requests around this time before another downtime event of 18 minutes starting at 05:13 and lasting until 05:31. The site remained operational for another 15 hours from there, with a 1 minute flicker, and then 11+ days of uptime since.

If this was an intermittent network issue that could explain why our app’s requests continued to work while outside requests did not. The Pingdom checks of our status endpoint aren’t quite the same, but they definitely don’t support faulting Storyteller.io in these issues.

Child Processes

There’s something of a pattern to this happening in the late night/early morning hours. The incident prior occurred March 12 from 12:19am with non-stop flapping all the way overnight until 8am, totaling 5 hours 43 minutes. Before this February 28 had 43 minutes of downtime around 11:40-12:26am. Another pattern are long periods of 1-5 minute flapping that culminate in longer periods of downtime that then suddenly recover. Timestamps can be seen from Pingdom.

In the case of March 12 Modulus was experiencing an issue where a percentage of servos failed to start due to EADDRINUSE error, so re-deploying wasn’t even possible without their intervention. They had this to say.

InconceivableB: The root cause of the EADDRINUSE is a node process getting stuck that we're unable to kill. I'm not sure what's going on in your scenario that's causing it. When the new version is deployed, we're unable to kill the previous version, so when the new app launches, the port is in use.

It’s possible this incident was unrelated, but it does make me wonder about the child processes we spawn. Charlie previously showed us some examples of zombie processes they saw spawned from Solidus servers. Haven’t heard of further incidents though. This may have predated the switch to node-worker-farm.

Are any child processes spawned for resource requests?

Resiliency

We may not be able to identify the exact cause of this issue if it is truly intermittent network latency. But if this is the case I think we need more resilient resource fetching that doesn’t bring an entire site down. In Orator a site continued to load if resource requests failed, leading to issues of disappearing content and such but otherwise maintained some level of uptime.

There is a 3.5 second timeout for resources, but is the fetching synchronous? I’m wondering why the rest of the site doesn’t respond if all the resources timeout around the same time.

Fauntleroy commented 10 years ago

Workers are instantiated by this method: https://github.com/SparkartGroupInc/solidus/blob/master/lib/preprocessor.js#L52-L59

Preprocessor workers are the only child processes that exist in Solidus. These processes are also automatically reaped and restarted when they encounter an error or hang. It's possible there may be some problem with https://github.com/rvagg/node-worker-farm that causes processes not to be properly reaped. I'm also not completely sure if the graphs Modulus is giving us keep track of resource usage in child processes, so there's a possibility they might be locking the instance up. Presently workers can only take 100 calls before they're automatically reaped and restarted.

I'm still convinced that the problem is not within Solidus itself, due to the fact that this happens across all of our Solidus sites at once. All of our sites share other common threads; using the Storyteller.io proxy, the same DNS service, running on Modulus, fronted by Edgecast, etc etc. Isn't it more likely the problem lies in one of those places?

joanniclaborde commented 10 years ago

As @Fauntleroy is saying, I would be surprised if the problem is with Solidus itself. Unless we reach a limit of some kind in Modulus with all sites at the same time, or in a library we're using.

I suggest we add robustness to resources fetching first, so the servers can survive these kinds of problems, then we can dig deeper and fix the actual problems. The robustness changes will help all sites anyway, nothing is lost.

pushred commented 10 years ago

I think what we need with Modulus is more ammunition to show them that we think something is wrong on their end. They've been pointing the finger at Storyteller since the errors are timeouts. But Richard did acknowledge the possibility of a network-related error. The frequency of that just seems odd however, why would requests to the proxy continuously timeout in this way?

joanniclaborde commented 10 years ago

Was the sparkart.com site timeouting too? If so, then it's definitely not the proxy's fault, since it uses the old storyteller: https://github.com/SparkartGroupInc/sparkart.com/blob/master/views/index.hbs#L5-L6

pushred commented 10 years ago

Yup, this is what those resources looked like:

resources

Totally missed that they were still using our "proxy" site! I'll switch those over soon.

joanniclaborde commented 10 years ago

Here is the Pingdom data for the Modulus sites, for March 25th, between 6pm and midnight:

bravado.com

18:00–19:00   1,948 ms
19:00–20:00   1,841 ms
20:00–21:00   1,987 ms
21:00–22:00   3,772 ms
22:00–23:00   3,401 ms
23:00–24:00   1,915 ms

donnyosmond.com

18:00–19:00   693 ms
19:00–20:00   639 ms
20:00–21:00   672 ms
21:00–22:00   810 ms
22:00–23:00   718 ms
23:00–24:00   580 ms

immunityproject.org

18:00–19:00   362 ms
19:00–20:00   358 ms
20:00–21:00   319 ms
21:00–22:00   3,001 ms
22:00–23:00   2,347 ms
23:00–24:00   337 ms

keithurban.net

18:00–19:00   2,136 ms
19:00–20:00   2,045 ms
20:00–21:00   1,849 ms
21:00–22:00   4,331 ms
22:00–23:00   2,950 ms
23:00–24:00   2,379 ms

sparkart.com

18:00–19:00   450 ms
19:00–20:00   455 ms
20:00–21:00   443 ms
21:00–22:00   3,546 ms
22:00–23:00   3,021 ms
23:00–24:00   422 ms

And the Storyteller apis:

proxy.storyteller.io

100% uptime

18:00–19:00   338 ms
19:00–20:00   326 ms
20:00–21:00   339 ms
21:00–22:00   361 ms
22:00–23:00   331 ms
23:00–24:00   320 ms

api.storytellerhq.com

100% uptime

18:00–19:00   244 ms
19:00–20:00   272 ms
20:00–21:00   211 ms
21:00–22:00   242 ms
22:00–23:00   225 ms
23:00–24:00   214 ms

It's pretty clear that something happened to the Modulus sites (except donnyosmond.com) while nothing happened to the Storyteller apis. Am I missing any other Modulus site?

solidusjs / solidus