Usage report: timeouts and graceful shutdown

davidbalbert commented 1 year ago

I needed to write a simple task specific HTTP reverse-proxy. I spiked out the initial version in Go, but we're a Ruby shop, so I then attempted to translate it into Ruby with async-http. I ran into enough issues that we had to stick with the Go version. Here are the things I wasn't able to figure out.

Graceful shutdown of Async::HTTP::Server – The only way to stop the server seems to be calling stop on its task. I needed something closer to Server.Shutdown in Go's net/http package. The server should stop accepting new connections immediately, and then should wait a configurable amount of time for existing connections to finish. Something like this:

task.with_timeout(10) do
  server.shutdown
end

I didn't have to deal with long-running connections like WebSockets, so this API would be good enough for me. Go provides func (srv *Server) RegisterOnShutdown(f func()) for that use-case (f gets called by Shutdown) but I'm not sure that's totally necessary. I don't understand the use-case well enough.

Read timeouts – I needed to be able to make sure nobody could DOS the proxy opening up a ton of connections and not writing any data to them, eventually using up all available file descriptors. I tried setting Async::HTTP::Endpoint#timeout which does in fact time out the request in that situation, but that would raise an exception that I wasn't able to catch, and (I believe) the exception was raised all the way up through Server#run which meant that once the first request timed out, the task accepting new connections was dead. I wasn't able to find a place where I could rescue that exception usefully.

I attempted to subclass Server to get this working, but based on the way that the server ping-pongs control to the Endpoint and back again during accept, there didn't seem to be any method I could override to make this work.

Having more control over where tasks get canceled – This is probably more for async than async/http. If you call Task#stop on another Task, I believe it'll stop the next time that task yields back to the scheduler (for IO, etc.). It would be nice to have more control of when the task stops – e.g. to be able to clean up some in-flight work. In Go, the context API gives you this control. Goroutines can get canceled wherever you read from the Done channel. I'm not sure that the context API is the best one to copy (I found it pretty confusing at first), but it does give you this capability.

In the graceful shutdown example above, I'm not actually guaranteed that "stop accepting new connections" work will happen before the timeout hits, though it's admittedly very unlikely that it wouldn't be done within 10 seconds.

Hopefully this is helpful. I'm in awe of the amount of work that's gone into making the Ruby interpreter and ecosystem work with Async. I never thought we could get something quite so nice given that the language wasn't originally designed with this execution model in mind. Thank you!

ioquatix commented 1 year ago

Thanks for the great feedback.

the task accepting new connections was dead

Are you sure about that? That shouldn't be the case.

davidbalbert commented 1 year ago

You’re welcome!

Re: the task accepting new connections dying, I thought I was, but I could be wrong. I’ll come up with a simple repro this weekend to check myself.

davidbalbert commented 1 year ago

My curiosity got the better of me and I wrote the test case now. You're right. The task accepting new connections doesn't die.

There's just not anywhere for me to handle the Async::TimeoutError, which is a much lesser problem, though still somewhat unwieldy.

For reference, here's the script I used:

require 'async/http'

endpoint = Async::HTTP::Endpoint.parse("http://[::]", timeout: 5)

Async do |task|
  task.async do
    server = Async::HTTP::Server.for(endpoint) do |request|
      Protocol::HTTP::Response[200, {}, "Hello, world!\n"]
    end

    server.run
  end
end

Then ran the script as well as nc localhost 80. After 5 seconds, nc died, and I got this in the Ruby logs:

  7.0s     warn: Async::Task [oid=0x280] [ec=0x294] [pid=99532] [2023-03-03 19:39:16 -0500]
               | Task may have ended with unhandled exception.
               |   Async::TimeoutError: execution expired
               |   → /Users/david/.gem/ruby/3.2.1/gems/async-2.4.0/lib/async/wrapper.rb:35 in `wait_readable'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-io-1.34.3/lib/async/io/generic.rb:220 in `async_send'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-io-1.34.3/lib/async/io/generic.rb:69 in `block in wrap_blocking_method'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-io-1.34.3/lib/async/io/stream.rb:263 in `fill_read_buffer'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-io-1.34.3/lib/async/io/stream.rb:133 in `read_until'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-http-0.60.1/lib/async/http/protocol/http1/connection.rb:51 in `read_line?'
               |     /Users/david/.gem/ruby/3.2.1/gems/protocol-http1-0.15.0/lib/protocol/http1/connection.rb:160 in `read_request'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-http-0.60.1/lib/async/http/protocol/http1/request.rb:31 in `read'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-http-0.60.1/lib/async/http/protocol/http1/server.rb:40 in `next_request'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-http-0.60.1/lib/async/http/protocol/http1/server.rb:61 in `each'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-http-0.60.1/lib/async/http/server.rb:56 in `accept'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-io-1.34.3/lib/async/io/server.rb:32 in `block in accept_each'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-io-1.34.3/lib/async/io/socket.rb:73 in `block in accept'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-2.4.0/lib/async/task.rb:107 in `block in run'
               |     /Users/david/.gem/ruby/3.2.1/gems/async-2.4.0/lib/async/task.rb:248 in `block in schedule'

And after that, I was able to curl localhost and get "Hello, world!" in response.

ioquatix commented 1 year ago

My curiosity got the better of me and I wrote the test case now.

It's okay, it also happens to me, I believe it's sometimes called "nerd sniping" =)

Regarding timeouts, it's a very tricky thing to handle generally, as the I/O instance with the timeout can be passed around to different tasks.

What you can do is use rescue and ensure around your own code to make sure any resources are tidied up.

Additionally, if you use the streaming model, you will almost certainly completely control the reading and writing (but it won't always be via the direct I/O). In that case though, if you time out, it will also be propagated to the next call to read/write in your streaming response body.

socketry / async-http

Usage report: timeouts and graceful shutdown #114