socketry / async-http

MIT License
298 stars 45 forks source link

Internet: Support specifying the client's connection limit #19

Closed judofyr closed 3 years ago

judofyr commented 5 years ago

There doesn't seem to be a way to specify the client's connection limit setting when you're using Async::HTTP::Internet. This is a problem for me as I'm getting throttled (violently) by the server. It would be great if async-http could handle the throttling instead of me adding another pool-abstraction.

judofyr commented 5 years ago

I also tried to add support for it myself, but I couldn't find a good way to write a test for it…

ioquatix commented 5 years ago

The simplest way to do this is to use a semaphore to limit the number of active connections.

require 'async'
require 'async/http/internet'
require 'async/semaphore'

Async do
    internet = Async::HTTP::Internet.new
    semaphore = Async::Semaphore.new(2)

    keywords = ["apples", "oranges", "bananas", "peaches", "apricots"]

    keywords.each do |keyword|
        semaphore.async do |task|
            url = "https://google.com/search?q=#{keyword}"

            task.logger.enable(task)

            task.logger.debug(task) {"Fetching #{url}..."}
            request = internet.get(url)

            count = request.read.scan(keyword).count
            task.logger.info(task) {"Found #{count} #{keyword}"}
        end
    end
ensure
    internet&.close
end
ioquatix commented 5 years ago

There is no guarantee about how the underlying Async::HTTP::Internet instance will manage connections... if you connect to different hosts, or whether you can use HTTP/2 multiplexing. The pool will try to make as many connections as required to minimise latency and maximise concurrency. It's up to you to decide whether this is appropriate or not.

If you are willing to expend more effort, you can specify the maximum number of connections (connection_limit:) here https://github.com/socketry/async-http/blob/master/lib/async/http/client.rb#L36

We could pass this as an option via Async::HTTP::Internet but it would apply to all connections. For HTTP/2 and HTTP/1 the semantics are a bit different, since HTTP/1 we should allow 4-8 connections, but HTTP/2 might be okay with 1 connection multiplexing 100 requests (typical default).

ioquatix commented 5 years ago

This forces to use only one HTTP connection.

I show debug output.

require 'async'
require 'async/http/url_endpoint'
require 'async/http/client'

Async do |task|
    task.logger.debug!

    endpoint = Async::HTTP::URLEndpoint.parse("https://google.com")
    client = Async::HTTP::Client.new(endpoint, connection_limit: 1)

    keywords = ["apples", "oranges", "bananas", "peaches", "apricots"]

    keywords.each do |keyword|
        task.async do
            url = "/search?q=#{keyword}"

            task.logger.enable(task)

            task.logger.debug(task) {"Fetching #{url}..."}
            request = client.get(url)

            count = request.read.scan(keyword).count
            task.logger.info(task) {"Found #{count} #{keyword}"}
        end
    end
ensure
    client&.close
end

HTTP/2 allows multiplexing so it's actually okay.

 0.83s: #<Async::HTTP::Pool resources=1/100 limit=1>
      | Wait for resource #<Async::HTTP::Protocol::HTTP2::Client:0x00007fa79b8b8db0>
 0.83s: #<Async::HTTP::Pool resources=2/100 limit=1>
      | Wait for resource #<Async::HTTP::Protocol::HTTP2::Client:0x00007fa79b8b8db0>
 0.83s: #<Async::HTTP::Pool resources=3/100 limit=1>
      | Wait for resource #<Async::HTTP::Protocol::HTTP2::Client:0x00007fa79b8b8db0>
 0.83s: #<Async::HTTP::Pool resources=4/100 limit=1>
      | Wait for resource #<Async::HTTP::Protocol::HTTP2::Client:0x00007fa79b8b8db0>
ioquatix commented 5 years ago

One thing we could probably do is have a default connection limit based on the endpoint/protocol. That means that by default the connection pool would only make one connection for HTTP/2 and I guess 4-8 for HTTP/1.

judofyr commented 5 years ago

One thing we could probably do is have a default connection limit based on the endpoint/protocol. That means that by default the connection pool would only make one connection for HTTP/2 and I guess 4-8 for HTTP/1.

This is the feature I was thinking about. That way I can start plenty of requests and I'll get maximum concurrency without spamming a single server. If I want to limit the overall request-flow I could then wrap it in a semaphore.

ioquatix commented 3 years ago

Here is one idea: to implement this, make a sub-class of Async::HTTP::Client.

class RateLimitedInternet < Async::HTTP::Internet
    def client_for(endpoint)
        connection_limit = endpoint.protocol.multiplex? ? 1 : 8
        Client.new(endpoint, **@options, connection_limit: connection_limit)
    end
end

Something like this will work. Maybe we can do it by default for Async::HTTP::Internet. Not sure yet.

ioquatix commented 3 years ago

I researched the limits.

According to https://tools.ietf.org/html/rfc7230#section-6.4 there is no specific upper bound for the number of connections, although most browsers limit it to 6-8.

ioquatix commented 3 years ago

We could add something like the following:

            def multiplexed?
                self.protocol.multiplexed?
            end

            def default_connection_limit
                if self.multiplexed?
                    1
                else
                    8
                end
            end

However, it's tricky to know in advance whether this applies to HTTPS, since it could negotiate either HTTP/1 or HTTP/2... this is non-trivial to solve. It's entirely possible the first connection could be HTTP/1 and the second connection could be HTTP/2.

ioquatix commented 3 years ago

All things considered, I think the best solution would be something like:

internet = Async::HTTP::Internet.new(connection_limit: 8)

I think this is reasonable enough to solve your problem without over-complicating the implementation. I'm not sure there is a significant advantage to having protocol specific limits.

judofyr commented 3 years ago

All things considered, I think the best solution would be something like:

internet = Async::HTTP::Internet.new(connection_limit: 8)

I think this is reasonable enough to solve your problem without over-complicating the implementation. I'm not sure there is a significant advantage to having protocol specific limits.

Is this the total limit of number of open connections allowed? I would prefer it a per-host (per-protocol is not so important), but I can always accomplish that by creating my own Internet-instances per-host.

ioquatix commented 3 years ago

It's per-client/host.