Falcon for large edge includes (developing/applying IO strategy)

julik commented 5 years ago

Thank you for exploring async in Ruby! We are currently looking at raising our throughput on our download servers (if you are curious there is a presentation about it here https://speakerdeck.com/julik/streaming-large-files-with-ruby - specifically see slide 16

[libCURL get to *FILE] -> [Ruby File object] -> [non-blocking sendfile + IO#wait when EAGAIN]

This is repeated multiple times to "splice" response bodies together and works really well except that one response served this way consumes an entire thread. Async and fibers seem to be a good way to approach this problem, and nio4r also seems to be a great option because I can leverage the ByteBuffer implementation and the IO reactor. But this is where the question obviously arises. In our scheme we have at the moment (see above) there are two elements which are crucial - that we download upstream data into a file buffer (in reality it is on a RAM filesystem) which is not in the Ruby heap. We then tell the kernel to write that file into the socket that services our client on the Puma side, and again no bytes enter the Ruby heap. In practice it means that we have next to no allocation overhead during streaming, regardless of the size of the workload / number of upstream requests we need to perform. nio4r supports this usage pattern if you use raw sockets and a ByteBuffer, from what I understood in the docs something like this

buf = ByteBuffer.new(capacity: 24*1024)
while ... do # check for EOF etc
  buf.read_from(file) # If I get a 0 as return value I could call task.yield here (unlinkely)
  buf.flip
  buf.write_to(socket)  # If I get a 0 as return value I could call task.yield here as well (very likely)
  buf.flip
end

This would allow reuse of the ByteBuffer and while not as efficient as sendfile()- it would become read(infd, buf, size); write(outfd, buf, size) it would still allow us to accomplish what we need (not introducing these heaps of data into the Ruby string heap).

I have taken a peek at the beers.rb example and what I could envision is something like this if I reproduce it:

def stream_through(output_body, task, url, headers_for_upstream)
  # Obtuse URL management is not nice :-(
  uri = URI(url)
  base_uri = uri.dup
  base_uri.path = ""
  base_uri.query = nil

  endpoint = Async::HTTP::URLEndpoint.parse(base_uri.to_s)
  client = Async::HTTP::Client.new(endpoint)
  request = Async::HTTP::Request.new(client.scheme, uri.hostname, "GET", uri.request_uri, headers_for_upstream)
  response = client.call(request)
  if (200..2006).cover?(response.status) && body = response.body
    while chunk = body.read
      output_body.write(chunk)
      task.yield # Inserted for completeness sake
    end
  end
  response.close
end

run do |env|
    known_content_length = ... # I do know it in advance
    remote_urls_to_splice = ... # as well as these

    current_task = Async::Task.current
    async_output_body = Async::HTTP::Body::Writable.new
    current_task.async do |task|
      remote_urls_to_splice.each do |url_string|
        stream_through(async_output_body, task, url_string, {})
      end
    ensure
      async_output_body.close
    end

    [200, {'Content-Length' => known_content_length}, async_output_body]
  end
end

The problem that we have is that this will work well when the data transfer is relatively low-volume (chats, websockets etc.). But for us it will immediately blow up the Ruby heap with strings, since Async::HTTP::Body::Writable from what I can see is basically a "messagebox" (channel) for Strings. Mem use will be probably similar to what you could achieve with Rack's #each on a streaming body yielding Strings (we tried, it is immense and the application doesn't fit in RAM very quickly). What I want to do instead is pass the lowest-level possible objects to the reactor and tell it "Dear reactor, please use this fixed buffer to copy N bytes from this fd to that fd, and if there is an EAGAIN yield and try again later". But if both my "upstream" response body and my writable server response body are messgeboxes for Strings this option doesn't actually exist right?

Strictly speaking - yes, I am looking for an alternative to a non-blocking splice(). I can have plain (no-SSL) upstreams if that makes the job easier, I can also omit the "buffer to file first" if the rest of the setup works well. Everything in the setup is strictly HTTP1.1 at this point and the previous implementation even used HTTP1.0 for simplicity's sake.

So the question is, I guess - is this kind of workload a fit for falcon? Is it a good fit for nio4r? I do have the feeling that orchestrating these large-volume IO ops with Ruby should be perfectly feasible but when I examine examples and involved collaborator modules all I see are Strings, Strings, Strings... (primarily async/http). Is there some kind of wrapper around the nio4r ByteBuffer maybe that I could use as the async response body instead maybe?..

Maybe somehow get access to the actual output socket Falcon sets up (a-la Rack hijack) and perform non-blocking IO on that socket manually via nio4r?

I believe this is intimately related to https://github.com/socketry/falcon/issues/7 among others.

Edit: or if there is something I could use to no-string-copy from the async-http client body to the writable body of my HTTP response that could work too 🤔

ioquatix commented 5 years ago

I think you should try it first and then see if memory usage/strings allocation is an issue. We've already done some optimisation in this area (minimising string allocations). Once you've figured out specific code paths that are causing junk to be allocated, it could be raised as an issue.

Regarding splicing from input to output, it's not protocol agnostic especially w.r.t. HTTP/2.

That being said, maybe there is room for a HTTP/1 specific code path which suits your requirements.

NIO4R byte buffer hasn't been too useful in practice, but maybe we could make that work better if we know specifically what parts aren't up to scratch.

julik commented 5 years ago

👍 Thanks, we will do some stress testing.

ioquatix commented 5 years ago

Did you make any progress on this?

julik commented 5 years ago

We did. We ran falcon with 6 worker processes, putting it behind nginx on one of our production instances (where puma used to run instead). We had to switch to a TCP socket from a unix socket that Puma uses for that.

I also had to implement a backstop for the async backpressure issue which would otherwise destroy us, something like this

  def wait_for_queue_throughput(output_body, max_queue_items_pending, task)
    # Ok, this is a Volkswagen, but bear with me. When we are running
    # inside the test suite, our reactor will finish _first_, and _then_ will our
    # Writable body be read in full. This means that we are going to be
    # throttling the writes but on the other end nobody is really reading much.
    # That, in turn, means that the test will fail as the response will not
    # going to be written in full. There, I said it. This is volkswagen.
    return if 'test' == ENV['RACK_ENV']

    # and then see whether we can do anything
    max_waited_s = 15
    backpressure_sleep_s = 0.1
    waited_for_s = 0.0
    while output_body.pending_count > max_queue_items_pending
      LOGGER.debug { "Slow client - putting task to sleep" }
      waited_for_s += backpressure_sleep_s
      if waited_for_s > max_waited_s
        LOGGER.info { "Slow client, closing" }
        raise "Slow client, disconnecting them"
      end
      # There should be a way to awake this task when this WritableBody has been read from on the other end
      task.sleep(backpressure_sleep_s)
    end

    # Let other tasks take things off the queue inside the Body::Writable
    task.yield
  end

To do this I had to expose the queue item count on the Body::Writable thing. Our upstream we pull data from is S3 over HTTPS, but we are pulling many different objects at the same time. Since the default chunk size seems to be hovering around 4KB in our case I opted for 8 items on the queue as limit.

We limited the server to the same number of clients allowed to connect as our current implementation (600) and here is what happened:

grafana-dlserver-async

I think you were right that we needed to test this first, as the mem situation seems to be fine, we are not leaking much - at least not in a few hours we ran the test, so hats off to your buffer size choices and how you managed to reuse a string there ❤️

What we did observe:

Way less context switches, yay!
Doesn't seem to leak. Yay!
We are not achieving the same troughput we used to have with the same number of clients, while we actually want to exceed it (this is a c8 EC2 instance)
We are pegging the CPUs and we are pegging them quite a bit. Most likely this became our limiting factor instead of the network bandwidth we can consume, and I suspect it is because of this pumpin' of strings around. On an rbspy profile from our previous implementation the most CPU time came from running IO.select on a single writable socket, not from the actual sendfile() or IO.copy_stream which are just a write()/read() pair if you do not use SSL. I did not have time to profile this in production yet as we could only carve out a small window to do the experiment and...
It looks like either falcon or nginx is leaking TCP connections. When we looked at netstat we found a ton of connections from nginx downstream to CloudFront origins in TIME_WAIT state. So even though we do send Connection: close on our responses it seems nginx still makes its connections to downstream keepalive, which is not desired. We need to tweak nginx' settings a bit since we do not feel ready exposing a naked Ruby webserver for this workload just yet.
We also dispatch webhooks from this thing (not too many but a few) and during this testing the webhooks were not async-enabled - it was sync Patron, so we were blocking the reactor during their dispatches
We ping Redis during this procedure and this is not async-enabled, but Redis is run locally and very fast so I doubt it contributes much
Our main Rack application action which creates the download manifests is also not Async-enabled at this stage since that would be a bit much rewriting at once

I am intending to force nginx to do a connection: close and the webhook dispatch has been replaced by asynchttp now, so we are going for another round of tests in January. I think we will also reduce the number of processes. But it does seem I do need a lower-level IO strategy for this. I am almost contemplating injecting an Async reactor into Puma on a separate thread so that we can "ship off" hijacked sockets to it. Would welcome any advice ;-)

ioquatix commented 5 years ago

That is really useful feedback.

We have async-redis which is pretty decent, but undergoing active development right now.

Falcon does all parsing within Ruby land so it's going to be slower than a server which implements it in C. But for many applications, the overhead is not so big.

Leaking connections seems odd. If you can make a small repro with only Falcon I'd be interested to see it because we also check for leaking sockets. The Falcon test suite is pretty small though.

There are a handful of options.

One thing which might benefit you, is the planned work for a C backend for falcon to optimise the request/response cycle on the server side. This will be an optional paid upgrade. Additionally, if you are interested, I have an open source library which is well proven for handling large numbers of request and large amounts of data. We can shape this into a custom web server for your exact requirements and I guarantee you will achieve within a few % of the theoretical throughput of the hardware/vm.

ioquatix commented 5 years ago

Do you mind explaining the path you are taking through Falcon for serving content. Are you using HTTP/1.1? What are you using for the upstream request?

ioquatix commented 5 years ago

I will implement back pressure within the queue too - I'll try to make it in the next release. Your implementation might not be optimal.

julik commented 5 years ago

We are using HTTP/1.1 from falcon to nginx, and HTTP/1.1 from nginx to CloudFront which is our fronting CDN. HTTP/2 is not in the picture for us at the moment. To clarify: nginx is "downstream" for falcon, CloudFront is "downstream" for nginx. Our "upstreams" (servers our Ruby app is making requests to) are S3 for the data we proxy through and a couple of small requests to our internal systems for metadata, also over HTTP/1.0. These do not egress our VPC and are tiny compared to the amount of data "put through" from S3 to downstream.

One thing which might benefit you, is the planned work for a C backend for falcon to optimise the request/response cycle on the server side.

These are interesting propositions. I did look at the business support model for falcon but I don't think we are ready to commit to it at this stage. First we have a pretty variable workload and though we can predict how many proxy servers we are going to run by way of capacity planning, having what is effectively a support contract for that number of servers might be not very considerate at this stage. It might also happen that we are going to replan to use a different runtime and then can drastically reduce the number of servers since we are going to be able to saturate their NICs to the maximum. Second is we obviously need to see the requisite performance materialise.

So at the moment I think contributing to the ecosystem with explorations, tests and eventual patches might be a better option, but I might be mistaken ofc.

This will be an optional paid upgrade. Additionally, if you are interested, I have an open source library which is well proven for handling large numbers of request and large amounts of data. We can shape this into a custom web server for your exact requirements and I guarantee you will achieve within a few % of the theoretical throughput of the hardware/vm.

I am interested. There is a bit of a concern for me that probably building an entirely custom proprietary webserver might be a bad idea from the point of view of my colleagues since they also will have to support it and debug it should things go south. Let's chat ;-)

Your implementation might not be optimal.

Yes, please ❤️ The best I could find is opportunistically sleep the task for some time, I am certain it could be woken up sooner if the task is somehow coupled to the nio4r monitor.

P.S. I do believe that we could achieve this throughput if it were possible to get access to the nio4r socket objects from within falcon already tho.

ioquatix commented 5 years ago

https://github.com/socketry/async-http/issues/6 is now fixed. It has documentation which might help you.

julik commented 5 years ago

Awesome!

ioquatix commented 5 years ago

First we have a pretty variable workload and though we can predict how many proxy servers we are going to run by way of capacity planning, having what is effectively a support contract for that number of servers might be not very considerate at this stage

If you can think of a better way to do this I am open to ideas.

ioquatix commented 5 years ago

P.S. I do believe that we could achieve this throughput if it were possible to get access to the nio4r socket objects from within falcon already tho.

Realistically, the only way to do something like this would be a partial hijack. That's not supported in Falcon at the moment. But maybe it's possible. Only full hijack is supported, and it's Rack compatible so it returns the raw underlying IO, extracted from the reactor:

https://github.com/socketry/falcon/blob/d19f4d095cb380462a4d7e1abea2d25804c10ebd/lib/falcon/adapters/rack.rb#L133-L142

Maybe there is a better way to do this, or even just expose the IO directly in env.

ioquatix commented 5 years ago

Can you explain your ideal slow client disconnecting policy? e.g. less than x bytes/s for y minutes? or something else?

julik commented 5 years ago

The ideal would be that if bytes per second for a client is below N bytes per second over N seconds I would kick the client out. However, I can in a way "abstract this up" because my chunk size ends up pretty much always being the default nonblocking read chunk size async-http provides, so I can extrapolate the from that and disconnect clients if there is no movement in the queue for that much time - which is the abstraction I found so far. I do have an object that keeps tabs on how much data got sent over the last N seconds and I can use that object as well, but let's try to contemplate the queue length indicator one for a minute.

With the implementation I had there was some measurement because the task would resume in a polling fashion, after some time. With the new LimitedQueue implementation it seems that it is possible that, basically, the task can be stopped indefinitely if nothing starts reading from the queue due to the use of a condition. Imagine this:

Writing scope tries to add an item to the queue, queue is full, condition is created and set to notify the task to resume when something gets read from the queue. The task is "frozen" and the next task is scheduled
Nothing performs a read and the condition never gets a signal.
What then? The task ends up sitting in the task pool and never gets restarted probably? 🤔

I did try a simplistic experiment like this:

it 'reactifies' do
    reactor = Async::Reactor.new
    20.times do |i|
      reactor.run do |task|
        set_timer = task.reactor.after(0.1) { $stderr.puts "task #{i} blows up" }
        set_timer.cancel if i != 3
        $stderr.puts "Hello from task #{i}"
        task.yield
      end
    end
  end

That does work, the task 3 does print data. But if I raise an exception from the after block the exception brings down the reactor (which makes sense if I understand Timers correctly in that the timer is attached not at the task level but at the reactor level). There is also nothing quite like task.raise which is probably also a good thing since Thread#raise was long considered malpractice. But what else should be used in this case? If I manually sleep the task and do a timer comparison when it gets awoken, so even if that would wake the task more often than desired it would let me preempt the task to do the timer bookkeeping.

Basically I need "some" way to forcibly terminate a task if there is no movement on the queue for that much time. Or some way to poll an object for a decision on whether the client should be disconnected or not - IMO if we poll for it once per second or even less often the impact on the reactor wont be immense. I might be overthinking it tho...

ioquatix commented 5 years ago

This is something I’ve thought about for a while.

If you call socket.read should that operation block indefinitely?

I think the answer is no. Especially not by default.

There should be some logical timeout, or at least a way to specify it explicitly per socket per operation.

Does the OS provide a timeout? If I make a read operation with no data will it still be waiting 100 years later?

A minimum throughput is a similar issue. We have to be careful to design a system that is actually robust against slow clients, ideally not allocating resources in a way which makes it trivial to DoS a service/system.

Mitigatins at the queue level doesn’t prevent malicious users because there are other non-queue related areas of the protocol which can cause resource exhaustion.

So what I’d like to see is a wrapper around the socket or the stream buffer which handles this for the entire system. Ideally we can specify a policy eg minimum bit rates and timeouts, and have it work across the entire system.

julik commented 5 years ago

Yep, being able to set a timeout for each read and each write would be ideal. What I effectively have attempted with my polling solution is actually doing the write part of it. Moreover, if there is a way to set a timer that will awake and raise the fiber before calling write and then to cancel that timer we will have a workable solution. BTW, there is some fascinating thinking about cancelable tasks in http://joeduffyblog.com/2015/11/19/asynchronous-everything/ (in case you haven't seen it)

julik commented 5 years ago

Having investigated a bit, would this work? Specifically, will it "override" a Condition?

  TimeoutOnWrite = Struct.new(:async_task, :writable, :timeout_s) do
    def write(data)
      async_task.timeout(timeout_s) { writable.write(data) }
    end
  end

  body = Async::HTTP::Body::Writable.new(content_length, queue: Async::LimitedQueue.new(8))
  Async::Reactor.run do |task|
    body_with_timeout = TimeoutOnWrite.new(task, body, 3) # timeout on write in 3 seconds?..
    # ...many times over, repeatedly etc
    body_with_timeout.write(data_from_upstream)

ioquatix commented 5 years ago

Unfortunately it's not sufficient.

It needs to be in the buffering/socket layer.

julik commented 5 years ago

Yep, tried that implementation and though the timeout does fire it brings down the reactor (and the entire falcon process as well!)

I will revert to my less-than-ideal polling implementation for now.

ioquatix commented 5 years ago

You need to handle the timeout.

Something like

begin
    task.timeout do
        socket.write(...)
    end
rescue
    body.close
end

ioquatix commented 5 years ago

If that's the route you want to go down, even just temporarily, you should probably make a body wrapper with this behaviour. But as I said it's not a fully general solution.

julik commented 5 years ago

Happy 2019 @ioquatix and other socketry contributors!

We have deployed our falcon-based service as a canary and observing the results. Meanwhile I am trying to figure out where the limits are regarding the number of clients and how easy is it for falcon not to saturate the CPU but to "stuff the pipe". To that end I've implemented 3 simple "stuffer" webservers that generate one chunk of random data, and then repeatedly send it over the wire to achieve a given content-length.

To eliminate the network issues from the equation I tested over loopback for now. The results are interesting.

Go with stuffer.go all default options

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Connection: close
< Content-Length: 5861125462
< Date: Fri, 04 Jan 2019 12:28:46 GMT
< Content-Type: application/octet-stream
< 
{ [3953 bytes data]
100 5589M  100 5589M    0     0   720M      0  0:00:07  0:00:07 --:--:--  713M
* Closing connection 0

real    0m7.780s
user    0m2.575s
sys 0m4.542s

Falcon with async-io

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200
< connection: close
< server: falcon/0.19.6
< date: Fri, 04 Jan 2019 12:41:25 GMT
< content-length: 5861125462
< 
{ [16261 bytes data]
100 5589M  100 5589M    0     0   257M      0  0:00:21  0:00:21 --:--:--  260M
* Closing connection 0

real    0m21.739s
user    0m3.840s
sys 0m7.375s
julik@nanobuk stuffer (master) $

Puma with partial hijack and blocking write()

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 9395 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Connection: close
< Content-Length: 5861125462
< 
{ [16384 bytes data]
100 5589M  100 5589M    0     0   831M      0  0:00:06  0:00:06 --:--:--  842M
* Closing connection 0

real    0m6.742s
user    0m2.361s
sys 0m4.110s

The code is in the repo here: https://github.com/julik/stuffer

Unless I have really missed something, there is a roughly 3x overhead to these async bodies. Which sort of brings back my original question - is there a way, with the existing async-io model, for me to use the sockets directly and yield them back to the reactor if they would block? Or have a minimum size wrapper for this which would work with something like IO.copy_stream or NIO::ByteBuffer which both expect a real fd to be returned from #to_io?

ioquatix commented 5 years ago

Without digging into it too much (dude I'm on holiday at the beach), I did a quick test of your code vs using then LimitedQueue and got a 3x perf increase on my old MBP laptop.

Here is your current implementation:

> time curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9292 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9292
> User-Agent: curl/7.63.0
> Accept: */*
> 
< HTTP/1.1 200
< server: stuffer/falcon
< connection: close
< content-length: 5861125462
< 
{ [16384 bytes data]
100 5589M  100 5589M    0     0   127M      0  0:00:43  0:00:43 --:--:--  121M
* Closing connection 0
curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null  2.36s user 3.76s system 13% cpu 43.888 total

Here is using LimitedQueue:

> time curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9292 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9292
> User-Agent: curl/7.63.0
> Accept: */*
> 
< HTTP/1.1 200
< server: stuffer/falcon
< connection: close
< content-length: 5861125462
< 
{ [40960 bytes data]
100 5589M  100 5589M    0     0   310M      0  0:00:18  0:00:18 --:--:--  283M
* Closing connection 0
curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null  2.41s user 3.76s system 34% cpu 18.027 total

julik commented 5 years ago

Without digging into it too much (dude I'm on holiday at the beach)

Man I envy you we are freezing here in the northern hemisphere 🥶 Enjoy your holidays ;-) I will do some experiments with the limited queue just need to find a stopgap measure for it so that I wont have socket starvation on the reading end (connect to us, read 1 block, not read anything for a loooong time all the wile keeping the writing task asleep).

ioquatix commented 5 years ago

I think a timeout at the socket level makes sense for almost all protocols.

I don't even know if the timeout allows to be reset but something like this would be nice:

Task.timeout(60) do |timeout|
  body.each do |chunk|
    @stream.write(chunk)
    timeout.reset
  end
end

I wonder if @stream.write should take a flush option too, might minimise the overhead of writing many chunks.

Ultimately the implementation might be better at a lower level. I'll think about it.

ioquatix commented 5 years ago

@julik I'm playing around with some timeout/throughput concerns.

Do you think it makes sense to have per-socket timeout for any/all async operations? e.g. connect, send, recv, read, write, etc.
Do you think it makes sense to track throughput and disconnect if less than some minimum?
Do you have any other ideas about how this should work?

I'm thinking that socket timeouts for async operations belong directly in the socket wrapper and apply to all operations.

I also think there are higher level concerns about what constitutes a bad client... but not sure how generally these can be applied or if they should be protocol specific.

I know that slow clients as part of a DoS might continue to consume 1 byte per second. But does that really matter? If you put a low-throughput logic to disconnect sockets, DoS clients can just consume above that water mark. So, can such an approach really protect against malicious clients, or are we just trying to disconnect users who have broken the request upstream somehow (i.e. stopped reading response).

ioquatix commented 5 years ago

The other question is, should we have a timeout by default? It seems a bit silly to me that sockets can block code indefinitely.

julik commented 5 years ago

TL;DR:

Do you think it makes sense to have per-socket timeout for any/all async operations? e.g. connect, send, recv, read, write, etc.

Yes.

Do you think it makes sense to track throughput and disconnect if less than some minimum?

Yes, or provide a way to do so (hook into NIO)

Do you have any other ideas about how this should work?

At the minimum - two flags on the falcon executable that would set minimum throughput barriers for reading and writing. They could be default-"off" but you do need them.

All good questions. It is ofc to be debated whether it is possible to protect against both slow loris and slow read attacks completely. You could say that it is impossible, just as it is very hard to protect from a high-volume attack. But just like bike locks I think making the attacks less convenient to carry out is a good step to take. Another reason why I think this is especially relevant for Falcon is that from what I understand Falcon is aiming to be the webserver, without a fronting downstream proxy like nginx - which in today's setups generally does a good job of dealing with these attack types. But falcon is also supposed to do SSL termination from what I understand (because HTTP/2 and all), and in general it seems it is aiming to become the server for a machine providing a Ruby web application.

So IMO setting at least _basic_limits to protect people from slow HTTP attacks is in scope for falcon yes. How it should be configurable I don't know but I would say a certain number of bytes must flow through the pipe per second over that many seconds (window average). If this transfer rate is not maintained then the client should be forcibly disconnected. This applies to both reading the HTTP request (slow loris attack) and writing the response (slow read attack). So if you ask me IMO yes, you do need a timeout by default at least when you are not explicitly in websocket mode where a connection might be sleeping for minutes on end. I am not aware of attacks with "slow connect" but probably there are some 🤷‍♀️

I believe puma does not have slow loris protection but it reads the request using its own IO reactor, so it probably relies on the "we can handle many many clients" property of IO reactors for this. For writing Puma is susceptible to slow read as one response consumes a thread. It is probably less severe for falcon due to the intrinsic fact that falcon is one big IO reactor but the max fd limit on the server does become a concern.

That is the "transport" end of the problem, for which there probably should be configurable timeouts on the webserver level (maybe even config options for falcon itself).

In terms of IO - yes, I do believe you want to have configurable timeouts for all reads and writes simply because if you do not have them, say, in your HTTP client, it means you can only make requests to trusted HTTP servers as you need to make an assumption that the endpoint will not "hang you up" indefinitely. It is less of a problem with HTTP endpoints being "adversarial" (it can be if you do web scraping for example, it is a concern!), it can be a problem with endpoints being badly coded. For example there is an SSE endpoint in LaunchDarkly which is currently used via async-http. It is designed to send a "ping" message every now and then to keep the connection alive - and it is all good as long as this option works. but what if it just gives you an EAGAIN once and does not come up in the NIO reactor monitor list for 2 hours after? The calling code currently has to manage this and arrange reconnects if I'm not mistaken. Maybe it is even a feature that belongs in NIO.

For our uses without async-io we opted for configuring libCURL with certain timeouts we know are sane for our uses, and we use both the connect timeouts and the throughput gate (the endpoint must furnish that many bytes within that much time otherwise we bail out).

Regarding the service I am testing falcon on - it is more of an issue protecting from homegrown hand-rolled download managers that open a connection for a bytes=..-.. range of content but do not read it, or do not start reading it in time, or opportunistically open many connections using Range headers in the hopes that they will obtain better download speeds that way (which they won't, but they do consume a connection per range).

I also think there are higher level concerns about what constitutes a bad client... but not sure how generally these can be applied or if they should be protocol specific.

I don't know. I do feel that if there is, intrinsically, a pair of objects servicing a particular client (task + writable) there should be a way to exercise at least "some" push control over the reading client socket - to use these objects to "kick" the client out. If these objects wait on a condition variable for the client to have done something in the first place (if it is always reactive) then this becomes pretty hard to do.

With the current setup what bothers me the most is that I don't know whether falcon will time out a socket in a slow read situation, and if it will - where do I configure the timeout. Cause I sure do need that feature (or I need to investigate the bizarrio nginx documentation to figure out all the options that will, at the same time, protect me from these attacks and not buffer too much response along the way).

julik commented 5 years ago

For later - here is how many of these "enforce throughput" exceptions we have noticed since we started running the test code (last Thursday):

slowloris_count

I am going to try to integrate this with the limit queue by putting a few around conditionals on the write of the body meanwhile.

ioquatix commented 5 years ago

I've released async v1.13.0 which changed the internal timeout API to raise the Async::TimeoutError exception and I'm preparing an update to async-io which includes a per-socket timeout_duration:

https://github.com/socketry/async-io/blob/e9e7c268324002dc9e4db0f18a93bc4a0a26b38b/spec/async/io/socket_spec.rb#L87-L100

I'm not sure where is the best place to set timeout_duration, but perhaps the endpoint or accept_each could do it as an option.

This should catch connections that simply stop responding.

It won't catch connections that are maliciously slow though.

For that, we need more advanced throughput calculations.

@julik thanks so much for all your detailed feedback, it's really really awesome.

ioquatix commented 5 years ago

BTW, I haven't released async-io v1.18.0 yet because I don't know if it's the right design or not. So, any feedback would be appreciated.

julik commented 5 years ago

I think this is a great first step - we need to try it live to find out.

I believe it is more than doable to implement a throughput calculator that can be controlled from the writing side. Something like https://gist.github.com/julik/0abd93788293b1ab54cfa1b706570317 as long as there is some guarantee that sockets that did not get select()ed for a long time get forcibly closed and evicted. For example here https://gist.github.com/julik/f9e1ac1fda26ddc6fb58b024d548c866 I included socket evicition on the condition that a socket was not selected for more than N.

I do not thoroughly understand yet whether calling close on a Writable will lead falcon to close the socket immediately, without waiting for the client to accept the entire body. Basically what worries me is that I'm missing the understanding of how do you get rid of stale and/or maliciously slow connections. Maybe there is a need for some "command bus" attached to a Task or to the Writable - something like force_abort!?

ioquatix commented 5 years ago

If Body#read raises an exception, it is likely to kill the entire connection.

There is no mechanism for doing this with any of the default implementations, but you could certainly do it yourself.

ioquatix commented 5 years ago

Just thinking out loud... when clients go "silent", does that not eventually trigger TCP keep alive timeouts within the kernel? Why is it necessary to do this within the application, assuming that one could set the TCP keep-alive timeouts to something smallish (like, say, 2 minutes). This would avoid the requirement of doing it in application land, which is probably a slight overhead.

I can clearly see the benefit of some application side timeouts, but if this can be done in the kernel would it not be a good idea too?

julik commented 5 years ago

If Body#read raises an exception, it is likely to kill the entire connection.

Ok, that gets us somewhere. But since I only write into the body from the "emitter" task, how do I ensure the next read will raise short of subclassing the Body for these needs. And more importantly will read get called even if the socket was not select()ed by NIO just prior?

ioquatix commented 5 years ago

Any body wrapper can raise in #read. If an IO is unresponsive for long enough, eventually the timeout will kill it (as discussed/being added).

This only applies to HTTP/1. HTTP/2 multiplexing connections might be more tricky. HTTP/3 with UPD will require protocol-level timeouts to free resources.

ioquatix commented 5 years ago

I've just released falcon v0.20.0 which includes a 30s timeout by default for read/write/connect operations.

I've also released an update to async-http v0.37.10 which handles timeout failures for http/2 correctly.

More specs/tests welcome.

julik commented 5 years ago

🎉 There will be. Incredible where we are getting here 🎉

ioquatix commented 5 years ago

Just reading back to see how we are tracking with all the different topics:

It looks like either falcon or nginx is leaking TCP connections. When we looked at netstat we found a ton of connections from nginx downstream to CloudFront origins in TIME_WAIT state. So even though we do send Connection: close on our responses it seems nginx still makes its connections to downstream keepalive, which is not desired. We need to tweak nginx' settings a bit since we do not feel ready exposing a naked Ruby webserver for this workload just yet.

Did you ever find out why this was happening?

We also dispatch webhooks from this thing (not too many but a few) and during this testing the webhooks were not async-enabled - it was sync Patron, so we were blocking the reactor during their dispatches

We ping Redis during this procedure and this is not async-enabled, but Redis is run locally and very fast so I doubt it contributes much

Blocking the reactor could be causing issues, I'm thinking of adding logging to tasks which are hogging the event loop. Did you make any progress on using async replacements?

julik commented 5 years ago

Nope as it was weekend and I do not experiment on prod during the weekend at night. But I will once I have a falcon version that has forced timeout disconnect.

Blocking the reactor could be causing issues,

Once we are committed to falcon I can then rewrite our Redis use to be redis-async. At the moment it would be taking it a bit too far - only the streaming backends are pluggable. Redis pings are once per second per task at most.

Did you ever find out why this was happening?

We did find out a ton of connections in TIME_WAIT when we were investigating but our last testing round was cut short. We will try to figure out where the leaks come from.

julik commented 5 years ago

Alright, we finally got to doing another round of tests. This time we elected to remove nginx from the picture entirely and opened bare falcon to the outside. We served prod traffic from one machine to a few hundred clients again, and observed the results. On the whole it was really nice: as predicted we did peg as many cores as the number of falcons we launched, which is fine, and we reached pretty high bandwidth numbers even though these went up and down. For managing writing I have used the following subclass of the writable body: https://gist.github.com/julik/5d93a77e06d9418399cddfcd4368d9c7 which did indeed raise the SlowLoris exception a few times. We did also adjust the kernel keepalive setting to kick out dormant sockets. We did however get into trouble on one front. Here are the graphs for the machine we were using. This is the TCP connections:

ds_tcpconns

The two "rocks" are two time periods when we were running falcon. As you can see, we have pretty continuous growth of sockets in CLOSE_WAIT. When we look at the memory usage we see that these sockets also consume system memory - curiously enough falcon is great with userspace mem use, can't commend your buffering strategy enough 💯

ds_mem

We have also encountered multiple warnings about TCP memory buffers being too low for the number of sockets open from the kernel, and we had to use sockstat to raise the number of pages from the default. It did not help much. Neither did adjusting keepalive. When we were exhausting that memory completely the net throughput on the box would grind to a halt, painting the following picture:

ds_constip

Our primary workload is basically proxying stuff from S3. So we decided to investigate which sockets were kept open, and the list that came out from netstat was this:

https://gist.github.com/julik/8b96f1dc59c1eed2cc48529d0d27ee1f

This is the result piped to a grep of CLOSE_WAIT.

So it seems that falcon together with async-http is leaking sockets to S3 after servicing a request to S3. We did not see elevated error rates anywhere on the application, but for sure there is socket leakage otherwise we would not have observed this kind of telemetry. Also, if we can somehow solve this I think we could confidently deploy falcon solution onto the fleet. For reference, this is the method that handles pulling from S3: https://gist.github.com/julik/1e55255c8f582d3d96f7a93570dcc3d1

(and yes - I did get very hell-bent on "really really really closing the connection" so I did it three times, I wish I didn't have to - and yet it didn't seem to help).

I suspect there is a relatively straightforward something that does not explicitly close the socket hoping it is going to be reused for some subsequent request, and it does not get reused for that subsequent request. Or there is just no coupling between the response body of the HTTP client response and the actual socket in the reactor 🤔

I could share the entire bit of code responsible for this streaming process - it is not that big - but not completely in public since there are a few proprietary bits. Would love to have pointers on how to diagnose this further. I do see an issue about socket leaks here https://github.com/socketry/async-io/issues/14 but it seems to be during connect()...

ioquatix commented 5 years ago

I will investigate.

julik commented 5 years ago

👍 Note that the red line in the TCP conns is the ones in TIME_WAIT. Adjusting the tcp keepalive as you recommended we restarted falcon and then the second round happened where sockets are no longer in TIME_WAIT as much as you can see - but we still have linearly growing CLOSE_WAIT sockets.

ioquatix commented 5 years ago

One thing that may also be useful, is to send USR1 to one of the falcon processes and check the output when it has a handful of close_wait sockets. There is a chance, maybe, that it shows a list of connections and what tasks created them, what they are doing, etc.

julik commented 5 years ago

I will try to find a way to have one canary node with falcon instead of puma to play with, otherwise I need to wait at least until next Fri. till I can do the next round of testing

ioquatix commented 5 years ago

I think there is a chance I can reproduce this with specs, so don't worry about it, just next time you do it, try to grab the task tree from the falcon worker process (kill USR1 it).

ioquatix commented 5 years ago

It would be helpful to see the code which serves the request, do you think you can share it privately? Just flick me an email. Or a private gist. Whatever works.

julik commented 5 years ago

Done.

ioquatix commented 5 years ago

Okay, I took a look.

I guess there is more than one problem, but the first one I could see is that SpecialWritable#write raises an exception, and while this technically should be okay based on what you've done, it might be better to do it using self.close(error) which propagates the error to both ends. In any case, I see you are calling #close on the writable in the ensure block, so that part seems okay...

At a guess what is going wrong is something upstream. As your logs showed, the problem was with connections to AWS.

The main problem area would be def stream_through.

Firstly, making a new client for every request is not very efficient since persistent connections can't be used so it will unnecessarily increase latency.

I suspect something odd going on in that method, probably it's blocking on the upstream for some reason.

What seems odd to me is that the remote end is closing the connection (CLOSE_WAIT) but it sounds like it would be blocking in while chunk = body.read. That means the connection is being closed in the lower level, but it's not propagating correctly to the body. This is a very heavily specced area of the code, but that doesn't mean there aren't some odd edge cases which haven't been considered.

One way to detect this would be to add a timeout, e.g..

while chunk = task.with_timeout(120) {body.read}
  # ...
end

If you get timeout exceptions, it will help us understand where the problem is coming from.

ioquatix commented 5 years ago

Here is an implementation of slowloris body, but I haven't released this yet because I'm not sure if it's good idea.

https://github.com/socketry/async-http/blob/master/lib/async/http/body/slowloris.rb

What do you think?

socketry / falcon

Falcon for large edge includes (developing/applying IO strategy) #40