Open anewuser opened 1 year ago
Hm, I like the idea. We would need another caching layer (the third), for a specific download block (so not per url, as in https://github.com/pipes-digital/pipes/blob/master/downloader.rb). And then a way to invalidate that cache.
Or indeed as a per-pipe option, changing the cache logic in https://github.com/pipes-digital/pipes/blob/ea379d5b613da2fc8906ba17a2295d23c7b3890e/pipe.rb#L85.
Something else related to this: when a pipe is configured to download two or more URLs from the same domain in a row (as in a ForEach
+Download
block), it seems that all connections are made as fast as possible, which can trigger DoS protections automatically. It'd be interesting to add a timer between each connection to avoid that. The pipe requests would look even more organic if the timer changed randomly, as in this Tab Reloader option:
It's not supposed to be as fast as possible. The downloads are put into a ThrottleQueue, divided by domain. See https://github.com/pipes-digital/pipes/blob/master/downloader.rb#L21-L25:
@@limiters[url.host] = ThrottleQueue.new 0.4 if ! @@limiters[url.host]
result = ""
@@limiters[url.host].foreground(rand) {
result = _get(url, js)
}
If that leads to as fast as possible in a foreach, that would be a bug :/
This is one of the blogs that stopped my pipe with captchas when I was trying different combinations of Pipes blocks: https://pastebin.mozilla.org/kN5HgRk6
The problematic pipe was downloading the latest 3 or 4 posts. As I kept making changes and previewing them, the blog rightfully detected the pipe as a bot.
Another note on caching everything more aggressively: this doesn't need to be done with pipes that only connect to feedburner.com
or youtube.com
, since Google is unlikely to limit Pipes or suffer because of it. I've also started creating FeedBurner proxies for all of my feeds that go through download
blocks as way to lower down the number of Pipes requests to their domains.
This site, on the other hand, can barely handle its human visitors, so the less often Pipes downloads its front page with the pipe I have for it, the better:
When pipes have
download
blocks or are too slow to process, consider caching the output feed for a (much) longer time. You can also add these tags for them to suggest feed readers not to update them too often.I'd be fine with it if my feeds that fit this description were automatically cached for three days or even longer to save everyone's bandwidth and server resources.
This could also be added as an option for us to manually mark pipes that don't need to be updated for a long time. I have some pipes with
download
blocks that only really need to be checked once a month.