gurgeous / httpdisk

MIT License
5 stars 2 forks source link

IDEA: alternative backends #7

Open rickychilcott opened 3 years ago

rickychilcott commented 3 years ago

I was reading about Redis Streams the other day and it got me thinking about the power of Redis in general. In particular, I thought about an army of sidekiq workers crawling some sires for weeks or months, sharing caches via Redis.

Would you be interested in an extracted interface for cache backends? Similar (or the same as) https://api.rubyonrails.org/classes/ActiveSupport/Cache/Store.html?

gurgeous commented 3 years ago

I think a storage interface could be useful. Redis would work nicely for small datasets. For large crawls I've had good luck using sinew and the parallel gem to speed things along. My most recent crawl produced a 30gb cache :)

One wrinkle is cache expiration. With Redis and Memcache, you setup cache expiration as each key is written. Like set(key, value, 86400). With httpdisk on the other hand, cache entries can be expired at any time. For example, you might decide to recrawl and discard pages that are more than an hour old. Or maybe three days old.

httpdisk uses File.mtime to figure out if a value should be discarded. With other cache stores you'd have to store the key creation time to achieve the same functionality.