internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

Cache bad target hostname:port to avoid reconnection attempts #131

Closed vbanos closed 5 years ago

vbanos commented 5 years ago

If connection to a hostname:port fails, add it to a TTLCache with 60 sec expiration time. Subsequent requests to the same hostname:port return really quickly as we check the cache and avoid trying a new network connection.

The short expiration time guarantees that if a host becomes OK again, we'll be able to connect to it quickly.

Adding cachetools dependency was necessary as there isn't any other way to have an expiring in-memory cache using stdlib. The library doesn't have any other dependencies, it has good test coverage and seems maintained. It also supports Python 3.7.

vbanos commented 5 years ago

Performance improvement examples: 1st run of: time curl --proxy http://localhost:8000/ http://invalid123.com/ is 0.1 sec (DNS request required), 2nd run is 0.05 sec.

1st run of time curl --proxy http://localhost:8000/ http://vbanos.gr/slow.php is 3.8 sec (slow to respond URL until timeout). 2nd run is 0.01 sec.

vbanos commented 5 years ago

A minor limitation of the current implementation is that the cached response is always status=502, message timed out. The initial error could be anything from the 5xx range. E.g. when we have a DNS failure, its status=500, Name or service not known. The implementation would be a bit more complex and we would allocate a bit more memory if we also cached the exact error status and message. @nlevitt if you think its necessary, I can do it.

vbanos commented 5 years ago

I have improved caching to keep the status code and message. I have also tightened the use of locks everywhere.

nlevitt commented 5 years ago

Thanks @vbanos!