pipes-digital / pipes

Repository for Pipes
https://pipes.digital
GNU Affero General Public License v3.0
248 stars 19 forks source link

Cache expiration #99

Open anewuser opened 1 year ago

anewuser commented 1 year ago

When pipes have download blocks or are too slow to process, consider caching the output feed for a (much) longer time. You can also add these tags for them to suggest feed readers not to update them too often.

I'd be fine with it if my feeds that fit this description were automatically cached for three days or even longer to save everyone's bandwidth and server resources.

This could also be added as an option for us to manually mark pipes that don't need to be updated for a long time. I have some pipes with download blocks that only really need to be checked once a month.

onli commented 1 year ago

Hm, I like the idea. We would need another caching layer (the third), for a specific download block (so not per url, as in https://github.com/pipes-digital/pipes/blob/master/downloader.rb). And then a way to invalidate that cache.

Or indeed as a per-pipe option, changing the cache logic in https://github.com/pipes-digital/pipes/blob/ea379d5b613da2fc8906ba17a2295d23c7b3890e/pipe.rb#L85.

anewuser commented 1 year ago

Something else related to this: when a pipe is configured to download two or more URLs from the same domain in a row (as in a ForEach+Download block), it seems that all connections are made as fast as possible, which can trigger DoS protections automatically. It'd be interesting to add a timer between each connection to avoid that. The pipe requests would look even more organic if the timer changed randomly, as in this Tab Reloader option:

ad

onli commented 1 year ago

It's not supposed to be as fast as possible. The downloads are put into a ThrottleQueue, divided by domain. See https://github.com/pipes-digital/pipes/blob/master/downloader.rb#L21-L25:

@@limiters[url.host] = ThrottleQueue.new 0.4 if ! @@limiters[url.host]
result = ""
@@limiters[url.host].foreground(rand) {
    result = _get(url, js)
}

If that leads to as fast as possible in a foreach, that would be a bug :/

anewuser commented 1 year ago

This is one of the blogs that stopped my pipe with captchas when I was trying different combinations of Pipes blocks: https://pastebin.mozilla.org/kN5HgRk6

The problematic pipe was downloading the latest 3 or 4 posts. As I kept making changes and previewing them, the blog rightfully detected the pipe as a bot.

Another note on caching everything more aggressively: this doesn't need to be done with pipes that only connect to feedburner.com or youtube.com, since Google is unlikely to limit Pipes or suffer because of it. I've also started creating FeedBurner proxies for all of my feeds that go through download blocks as way to lower down the number of Pipes requests to their domains.

This site, on the other hand, can barely handle its human visitors, so the less often Pipes downloads its front page with the pipe I have for it, the better:

error