internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

Increase remote_connection_pool maxsize #138

Closed vbanos closed 4 years ago

vbanos commented 4 years ago

We noticed a lot of log entries like this in production:

WARNING:urllib3.connectionpool:Connection pool is full, discarding
connection: static.xx.fbcdn.net

this happens because we use a PoolManager and create a number of pools (param num_pools) but the number of connections each pool can have is just 1 by default (param maxsize is 1 by default).

urllib3 docs say: maxsize – Number of connections to save that can be reused. More than 1 is useful in multithreaded situations. Ref: https://urllib3.readthedocs.io/en/1.2.1/pools.html#urllib3.connectionpool.HTTPConnectionPool

I suggest to use maxsize=10 and re-evaluate after some time if its big enough. (We'll see if we still get "Connection pool is full" warnings in the logs).

This improvement will boost performance as we'll reuse more connections to remote hosts.

nlevitt commented 4 years ago

I think leaving maxsize at the default 1 was just an oversight. The // 6 in calculation of num_pools suggests that the intention was to have the max number of open connections == max_threads. Which also suggests to me that maxsize=6 is the appropriate value, again following the browser convention of max 6 connections per host.

-            num_pools=max((options.max_threads or 0) // 6, 400))
+            num_pools=max((options.max_threads or 0) // 6, 400), maxsize=10)
vbanos commented 4 years ago

@nlevitt its possible that multiple browsers will use one warcprox instance at the same time. In fact, in SPN2 production 5 browsers use the same warcprox instance. In that case, there could be more than 6 concurrent connections to the same host.

nlevitt commented 4 years ago

I still think 6 is a good number. max open connections = max threads seems like a sensible convention. What is the sense of // 6 in the num_pools calculation otherwise? 6 is already a lot of connections to maintain to one host. Making a connection is not all that expensive. Increasing maxsize uses up more filehandles. It's not at all obvious what the optimal numbers are for num_pools and maxsize. Maybe a larger value for num_pools (more hosts) and smaller for maxsize (fewer connections per host) would turn out to work better. Of course it depends on the workload. We're groping in the dark here. With that in mind, there is at least logic behind choosing 6.

vbanos commented 4 years ago

OK, I like your reasoning. I set it to 6.

nlevitt commented 4 years ago

Thanks!