h13 errors after 5 seconds

ashanbrown commented 10 years ago

I've had to stop using nginx because I was getting a significant number of h13 errors after exactly 5 seconds of service on a request (according to the nginx log). Occasionally I would see 13 after 10 or 15 seconds but mostly it was after exactly 5. Cutting out nginx fixed the problem. I would really welcome any thoughts about why this might be happening. The only 5 second timeout I know about is the database pool connection timeout, but I don't see why adding nginx would affect that (and it is supposed to log a message in my app, although maybe it's just failing).

ashanbrown commented 10 years ago

Meant to add that I'm using 2 puma worker processes per dyno.

gohanlon commented 10 years ago

Disclaimer: I've used Nginx in front of Ruby app servers for years (Unicorn mostly), but just started using it on Heroku a couple days ago. I myself am very new to using this buildpack and to using Nginx on Heroku.

I don't think your problem has to do with Puma or anything in your Ruby code. I think it has to do with slow or badly behaved clients (e.g a mobile phone on Edge network or a badly written bot), and perhaps an unnecessary settting in this buildpacks Nginx config.

In the context of Nginx, the client is the Heroku Router. The Router is opening a connection with the Nginx server running on your Heroku Dyno, but then 5 seconds pass without Nginx receiving any part of the request body from the Router. (If Nginx had received even part of the body and then received nothing further, your Nginx processing time until the timeout would be 5 seconds plus at least a few milliseconds.)

The lack of a request body is an upstream issue from your Dyno — either the Router is being slow in getting the end-client request relayed to your app, or the actual end client hasn't sent the request body to the Router. Unfortunately, I've seen problems with the Heroku Router before and I personally don't understand it completely enough, so I'd consider it a suspect until you know otherwise.

A timeout after exactly 5 seconds points at this buildpack's Nginx config line:

client_body_timeout 5;

I don't see any other timeout values of 5 seconds in this buildpack's Nginx config or in the default config values for Nginx. Here're the Nginx docs for client_body_timeout:

Defines a timeout for reading client request body. The timeout is set only for a period between two successive read operations, not for the transmission of the whole request body. If a client does not transmit anything within this time, the 408 (Request Time-out) error is returned to the client.

Nginx's default value for client_body_timeout is 60 seconds — this buildpack dramatically reduces this timeout. Personally, I've always used the default for this setting and never had a problem, probably because Nginx can keep a ton of open client connections while consuming a tiny amount of resources. (Nginx needs about 10MB of RAM per 10,000 connections, plus RAM for the buffers being flushed to those clients.)

It's also interesting that the analogous Router client timeout is 30 seconds after the Router connects to your Dyno.

While reducing Nginx's client_body_timeout does appear to be a common optimization, I'd argue it's a premature optimization unless you're actually in need of reducing the number of connections Nginx is keeping open. This optimization isn't likely to be worth much consideration unless you have tens of thousands of simultaneous clients.

Awesomely, this buildpack understands the need for individual apps to be responsible for their own Nginx config, and it looks really straightforward: https://github.com/ryandotsmith/nginx-buildpack#customizable-nginx-config

If I were you, I'd go ahead and follow the directions for using your own Nginx config, and then remove the client_body_timeout setting from your config (letting it be the default of 60 seconds). If that fixes your problem you're done (perhaps unless you already have 10s of thousands of simultaneous clients).

If that doesn't fix your problem I'd be interested in knowing that, too.

Lastly, consider whether these H13s are a problem at all. It's not that unusual for clients to start a request and then abandon it for some reason before it's been completely sent. Networks fail. Clients close connections mid-transfer because the user hit stop. As long as it's not happing too frequently or in instances where you know it shouldn't, it may be safe to ignore.

You may also find these two Heroku docs useful:

ashanbrown commented 10 years ago

Thanks for the awesome response. Indeed, I do have a lot of slow mobile clients, so your explanation sounds good. The whole reason I wanted to use nginx was because of these slow body uploads, because my understanding was that otherwise they would tie up one of my puma threads. I'll give a higher timeout a shot.

ashanbrown commented 10 years ago

@gohanlon One other question for you since you posted all those docs about routing: any idea how many requests could I end up with in nginx on on one dyno? There is a a max of worker_processes * worker_connections but it looks like, at least with a small number of dynos, the limit would probably be the 50 * imposed by each heroku router. This number can still be pretty large. The heroku docs say that a dyno has 30 seconds to handle a request once the connection is made (neither refused nor timed-out). My concern is that by having nginx in the middle, suddenly the dyno can accept far more connections that it could have before, and that the heroku router will just shovel work to the dyno without any backpressure, and before long you'll have more than 30 seconds worth of work in queue for a dyno. Is there something I'm missing or do I just have to run enough dynos so that heroku's random routing algorithm doesn't end up overloading one of them?

ryandotsmith / nginx-buildpack

h13 errors after 5 seconds #42