martinrotter / rssguard

Feed reader (and podcast player) which supports RSS/ATOM/JSON and many web-based feed services.
GNU General Public License v3.0
1.64k stars 125 forks source link

[FR]: Make feed retrieval use same cookies and useragent as the internal browser #556

Closed God-damnit-all closed 2 years ago

God-damnit-all commented 2 years ago

Brief description of the feature request

For similar reasons to #555, I'd like feed retrieval to use the same cookies and useragent as the internal browser. This would make it so that, even if Cloudflare becomes an obstacle, a page could be opened in the internal browser and then the user could just fetch the feed again.

martinrotter commented 2 years ago

Makes sense, but of course only cookies for the "URL" should get applied, not all cookies, right? So for example if your cache will have cookies for "www.abc.com/*" then only those cookies will be used if you load feed "www.abc.com/feed.xml" right?

martinrotter commented 2 years ago
God-damnit-all commented 2 years ago

Makes sense, but of course only cookies for the "URL" should get applied, not all cookies, right? So for example if your cache will have cookies for "www.abc.com/*" then only those cookies will be used if you load feed "www.abc.com/feed.xml" right?

Of course, only applicable cookies.

martinrotter commented 2 years ago

Do you perhaps have any "feed" which could be tested which only works with "cookies" etc.?

God-damnit-all commented 2 years ago

Do you perhaps have any "feed" which could be tested which only works with "cookies" etc.?

Not to any site that is currently accepting new registrations, but why not do something simple like setting a repo to private and then trying to get it to not return the 404 page that you see when you're not logged in?

martinrotter commented 2 years ago

That could be viable solution!

martinrotter commented 2 years ago

Well, not: https://stackoverflow.com/a/17419321

Private repos do not seem to provide "accessible with cookies" feeds.

martinrotter commented 2 years ago

But I guess I will find another way.

God-damnit-all commented 2 years ago

But I guess I will find another way.

Does it have to be a feed? If you'd like, I can just make a postprocessor script that'll spit out a page source as the first JSONFeed entry.

God-damnit-all commented 2 years ago

Here, use this post-processor script.

Post-process script line is:

powershell #-NoP #-c #$Input | &(rvpa( [string]{PATHgoesHEREwithoutQUOTES} ))

Or, if you're on Linux/MacOS, change powershell to pwsh and make sure to have PowerShell 7 installed.

It'll spit out a code block of the page, followed by the page itself. If necessary, edit the content_html=$code+$raw line, $code is the html-escaped page source and $raw is the normal page content.

martinrotter commented 2 years ago

Wow, what a cute script, thanks, will use it to debug the ticket.

martinrotter commented 2 years ago

OK, I implemented it.

If cookie is added/deleted in webengine, the change is propagated to internal cookie store of rss guard and vice versa.

Just tested with your script on website where I logged in and then tried to fetch and it really showed correct "logged-in" message. I guess this is done then.

God-damnit-all commented 1 year ago

@martinrotter I've been having issues with accessing (see this post's edit history for the URL)'s HTML data via the background feed retrieval lately. They've been getting DDoS'd a lot recently. As of writing, they currently don't have their Cloudflare protection up, but when it is up...

  1. I go to their site using RSS Guard's internal browser
  2. I click the verify human button a. The site loads up
  3. I close the tab
  4. I open a new tab with the internal browser to the same site a. The site loads without having to click in a verify human button
  5. I go back to my feeds and tell it to fetch the feed associated with the site a. It runs into the Cloudflare protection

This seems to suggest it's not using the internal browser cookies after all, that or the User-Agent between the internal browser and the feed retrieval is different, which is invalidating the cookies.

I know if I use curl with the same user agent and an exported cookies file from my browser, my post-processing script works on the retrieved data just fine.

This only happens when Cloudflare is in DDoS-protection mode on their site, all other times my feed for it work just fine.

This has been an issue for quite a long time now. Right now I'm on the latest nightly. I've just been lazy about raising a fuss about it, and was hoping it'd eventually get fixed on its own, no such luck though.