internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

Cache digest_str result in memory to improve performance #127

Closed vbanos closed 5 years ago

vbanos commented 5 years ago

We use warcprox.digest_str in 2 places during the course of a single HTTP request. 1) In all warcprox.dedup methods to get the key and lookup for duplicates, 2) in warcprox.warc to produce the WARC record.

We use lru_cache to avoid recalculating it.

We also reuse the cached result if the request for the same URL is done again.

nlevitt commented 5 years ago

I like the simplicity of the change, but really this function is a very cheap calculation. On my laptop it takes on the order of 1 microsecond. The cache management incurred by using lru_cache could easily outweigh the improvement.

>>> timeit.timeit('''hash.hexdigest().encode('ascii')''', number=1000000, globals=globals())
0.7279171659999975
vbanos commented 5 years ago

I didn't time that function. I just presumed caching would benefit us. Thank you for taking the time to check this.