martinrotter / rssguard

Feed reader (and podcast player) which supports RSS/ATOM/JSON and many web-based feed services.
GNU General Public License v3.0
1.54k stars 122 forks source link

[FR]: Throttle queued retrieval of feeds #1154

Open God-damnit-all opened 10 months ago

God-damnit-all commented 10 months ago

Brief description of the feature request

I know RSS Guard painstakingly implemented multi-threaded feed retrieval earlier this year or maybe last year (time flies), but there's a couple problems I've been having with it retrieving feeds as fast as possible.

  1. Many of the public hosts for RSS Bridge have recently been implementing "rate-limiting", and too many requests in a short period of time will cause them to start throwing 429 Too Many Requests errors  ⬨ I'll also get IP-blocked for about half an hour, so a simple retry doesn't suffice
  2. Tons of simultaneous exe launches in the background from script-sourced feeds and post-processing scripts, no matter how minimal their task, will cause my computer to lag if it's in the middle of something demanding (games in particular)  ⬨ I pre-compiled my Python scripts into .pyc files and I have RSS Guard invoke those instead of the .py files, but finishing faster also means the next one starts up faster, so it didn't help much

I do have some ideas on how this could be remedied. If you pursue this, I hope they'll be of some help.

Possible Short-term Solutions

  1. New Feeds & articles setting for how many feeds to retrieve simultaneously  ⬨ I'll refer to the amount of simultaneous feed retrievals specified here as "queue slots"
  2. New Feeds & articles setting which closes the "queue slot" by the amount of milliseconds specified after any feed is fetched  ⬧ Sub-option to randomize this delay each time it's triggered by a specified percentage, i.e. setting this to 15% means that 1000ms would be randomized to a value between 850ms to 1150ms      ⬨ This would to help "stagger" the overall workload
  3. New Feeds & articles setting to add a delay if a script was invoked during feed retrieval (either as as source or for post-processing) – the delay from the previous setting, if set, still applies  ⬧ Sub-option to randomize this delay each time it's triggered by a specified percentage, just like the previous setting
  4. New individual feed setting to set a delay after the feed is fetched – this makes it ignore any delay set by both the two previous settings  ⬧ Sub-option to randomize this delay each time it's triggered by a specified percentage, just like the two previous settings

Possible Long-term Solutions

  1. Intelligently queue order to evenly mix in feeds that do and don't involve scripts
  2. Hostname-specific configuration to prevent too many requests in too short a period of time  ⬨ This could also involve the intelligent queue order so it tries to space them apart
martinrotter commented 10 months ago
  1. Cannot you just make feed auto-fetch interval bigger?
  2. I see. Yes, naturally, if you use script as source, then its interpreter will launch each time.

Short-term

  1. This option is already there. Run "rssguard.exe --help" in console to see details. Option is called "--threads" and. You can set it to 1 to have fully sequential fetching. For your use-case, I feel like setting it to 2 might be "best".
  2. How exactly should this work/behave? Ellaborate.

Long-term

  1. Not easy to do since there is quite a number of steps involved in "fetching" each feed - downloading of remote content, article fitlering, scraping, saving to DB etc. And even if some ordering would be there, the rather "random" nature of threading in this case would still quite hamper this I think.
  2. Well, it would rather make sense to recognize HTTP/429 and apply "retry-after" cooldown to individual feed. Note that this situation might be now GREATLY avoided because I added new feature to RSS Guard (4.5.2) called conditional HTTP requests which is automatically used on RSS/ATOM/JSON accounts and it uses "ETag" HTTP header. Now RSS Guard will actually download RSS content if there is some change since last time. Therefore, well-setup server which provides some feed might avoid sending 429 if this new approach is used and all workflow will also be faster. Not sure if rss.hub/bridge implements "ETag" but many RSS/ATOM source I tested do
atomGit commented 8 months ago

just wanted to add that i'm seeing this problem as well (v4.6.3 minimal on Linux) - in my case with feeds from bitchute.com, but this could happen with any server that decides to throttle requests and i would submit that it is courteous to have a delay of at least 1 second when querying a domain consecutively

in the case of bitchute, it seems that the first one, or perhaps the first few feeds are retrieved, then all the rest fail (and i have many of them, so i've been missing a lot of news because i've not been expanding the folders to look at the status of the feeds)

example : https://www.bitchute.com/feeds/rss/channel/oldskoolhunter/

i thought this would be a problem when i first started using RSSG - this same problem was experienced with another reader and the solution was to add a delay when fetching feeds from the same domain

i had a similar issue with a broken link checker script and my solution was to a) randomize the links in an effort to avoid consecutive queries to a domain and b) to keep a rolling list (associative array) of 'n' recently queried domains along with a timestamp in seconds and add a variable delay, based on the time stamp, if the same domain was queried within 'x' time

with bitchute, it seems a 1-2 second delay is sufficient