skx / rss2email

Convert RSS feeds to emails
GNU General Public License v2.0
110 stars 19 forks source link

No longer recognizing reddit RSS feeds #83

Open duckunix opened 2 years ago

duckunix commented 2 years ago

This started in the last day or two for me. I am using the latest release version release-2.4. When trying to run, I get:

error processing https://www.reddit.com/r/swaywm/.rss - error parsing https://www.reddit.com/r/swaywm/.rss contents: Failed to detect feed type
error processing https://www.reddit.com/r/OPNsenseFirewall/.rss - error parsing https://www.reddit.com/r/OPNsenseFirewall/.rss contents: Failed to detect feed type

And so on for all my reddit entries.

Any thoughts?

skx commented 2 years ago

I remember when I first put the application together that Reddit didn't like it unless I setup a custom user-agent. They're a bit strict about blocking access.

So there's an obvious suspicion that the feed-request is just getting blocked/filtered/broken at their side. Can you download the feed(s) with curl, successfully?

If it's broken for everything then it's clearly their fault. If you can download via curl, but not via the app then it might be something I can fix.

For what it's worth my own feed (of "private inbox" messages) continues to work so it might not necessarily be something that is globally broken.

duckunix commented 2 years ago

Oddly, wget works just fine, but when I use curl, I get:


<!doctype html>
<html>
  <head>
    <title>Too Many Requests</title>
    <style>
      body {
          font: small verdana, arial, helvetica, sans-serif;
          width: 600px;
          margin: 0 auto;
      }

      h1 {
          height: 40px;
          background: transparent url(//www.redditstatic.com/reddit.com.header.png) no-repeat scroll top right;
      }
    </style>
  </head>
  <body>
    <h1>whoa there, pardner!</h1>

<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>

<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>

<p>please wait 8 second(s) and try again.</p>

    <p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>
  </body>
</html>

So, is there someway for me to put a sleep before/after a call to reddit?

Thanks!

duckunix commented 2 years ago

BTW:

grep -c reddit.com ~/.rss2email/feeds.txt
19
skx commented 2 years ago

Oddly, wget works just fine,

Then I'd probably suggest they're using the User-Agent header to differentiate the two requests. You might try changing your local agent. Something like this in your feed-list:

https://reddit.com/....
  - user-agent: my-safe-bot/1.0

As for sleeping between feed-requests? I'm afraid not, though it does seem like something that could be added. I could add:

http://example.com/foo
  - sleep: 10
http://example.net/blah.rss
  - sleep: 20

That would give a ten second sleep before fetching the first feed, and a twenty-second delay before the second.

Added that in #84 - along with a simple heuristic that adds a delay automatically if the feed being fetched is from the same hostname as the previous request. So assuming your feed contains:

reddit...
reddit...
reddit..
example.com...
example.com..

you won't need to make any config-file changes, it'll delay automatically.

duckunix commented 2 years ago

So, using the version 2.5, it is still not working for me on reddit. :( This is my test feeds.txt:

https://www.reddit.com/r/swaywm/.rss
 - template:reddit.tmpl
 - user-agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1
https://www.reddit.com/r/OPNsenseFirewall/.rss
 - template:reddit.tmpl
 - user-agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1

Which hopefully would be working with the delay and the user-agent string, but still no joy:

time rss2email cron -verbose <email@rededicated>
Fetching feed: https://www.reddit.com/r/swaywm/.rss

Fetching from same host as previous feed, www.reddit.com, adding 5s delay
Fetching feed: https://www.reddit.com/r/OPNsenseFirewall/.rss

Skipping the prune-step because we saw errors processing our feed(s)

error processing https://www.reddit.com/r/swaywm/.rss - error parsing https://www.reddit.com/r/swaywm/.rss contents: Failed to detect feed type
error processing https://www.reddit.com/r/OPNsenseFirewall/.rss - error parsing https://www.reddit.com/r/OPNsenseFirewall/.rss contents: Failed to detect feed type

real    0m5.173s
user    0m0.038s
sys     0m0.018s

Any thought, or should I go look for something to build custom RSS feed for my reddit feeds?

Thanks, d

skx commented 2 years ago

I'm sorry to hear that the recent delay didn't help, nor the user-agent switch.

Using some other wrapper, to fetch feeds from reddit, and present anew which you can then fetch locally should work - but I admit I'm not really too sure what options are out there, or how likely they are to get blocked in the future either. (Feedburner?)

But for this project I'm not sure there's any more useful changes I can make - I could add our version number to the default user-agent, but nothing else comes to mind.