nextcloud / news

:newspaper: RSS/Atom feed reader
https://apps.nextcloud.com/apps/news
GNU Affero General Public License v3.0
865 stars 186 forks source link

Some RSS feeds block access based on user-agent #2049

Open lbdroid opened 1 year ago

lbdroid commented 1 year ago

IMPORTANT

Read and tick the following checkbox after you have created the issue or place an x inside the brackets ;)

Explain the Problem

Certain RSS feeds fail to load through Nextcloud News, but work perfectly well from the console or browser. Example; https://nationalpost.com/category/news/feed.xml -- error 403 forbidden from nextcloud news, but loads correctly using wget/firefox/etc.

Steps to Reproduce

  1. Open Nextcloud News
  2. Click "+ Subscribe"
  3. Add feed URL such as https://nationalpost.com/category/news/feed.xml
  4. Press "Subscribe"

System Information

These details are not relevant since the specific issue has been detected, see below.

Issue details

I have determined that the feed in question is actually blocking access based on the User-Agent being used to make the query.

$ wget --user-agent="NextCloud-News/1.0" https://nationalpost.com/category/news/feed.xml
--2023-01-05 09:37:26--  https://nationalpost.com/category/news/feed.xml
Resolving nationalpost.com (nationalpost.com)... 34.111.249.109
Connecting to nationalpost.com (nationalpost.com)|34.111.249.109|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-01-05 09:37:26 ERROR 403: Forbidden.
$ wget https://nationalpost.com/category/news/feed.xml
--2023-01-05 09:39:03--  https://nationalpost.com/category/news/feed.xml
Resolving nationalpost.com (nationalpost.com)... 34.111.249.109
Connecting to nationalpost.com (nationalpost.com)|34.111.249.109|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8408 (8.2K) [application/rss+xml]
Saving to: ‘feed.xml’

feed.xml            100%[===================>]   8.21K  --.-KB/s    in 0.001s  

2023-01-05 09:39:03 (9.29 MB/s) - ‘feed.xml’ saved [8408/8408]

The only difference between the failure and success is the changed User-Agent. Similarly, changing all the strings "NextCloud-News/1.0" in the source code to "NotCloud-News/1.0" solves the problem and allows the feed to be retrieved.

The rationale for the feed source to work in this manner is quite simple; 10,000 servers with that user-agent pulling the feed may appear to be a DDOS attack.

SOLUTION

NextCloud News needs to be able to use alternative user-agents. Either automatically pick something that would be unique, like the server's domain name, or add a field to settings to set a custom user agent.

SMillerDev commented 1 year ago

NextCloud News needs to be able to use alternative user-agents. Either automatically pick something that would be unique, like the server's domain name, or add a field to settings to set a custom user agent.

No, if RSS feeds feel the need to block news they have a reason for that. I'm not starting an arms race with RSS authors.

lbdroid commented 1 year ago

NextCloud News needs to be able to use alternative user-agents. Either automatically pick something that would be unique, like the server's domain name, or add a field to settings to set a custom user agent.

No, if RSS feeds feel the need to block news they have a reason for that. I'm not starting an arms race with RSS authors.

Because they don't realize that these are self-hosted instances of Nextcloud. Its not about a war or arms race. Its about clearly differentiating each instance of Nextcloud News from all the others.

SMillerDev commented 1 year ago

But that's not what user agents are. User agents are meant to identify the software doing the request. Everyone with the same Google Chrome version has the same user agent and that doesn't cause any problems.

lbdroid commented 1 year ago

You're obviously right about that, but that doesn't mean that these media companies are SMART.

SMillerDev commented 1 year ago

So educate them about why they should not block this useragent.

lbdroid commented 1 year ago

That is an impossibility, which you are well aware of. The only viable option here is to alter the program, and since there is no justification to NOT adjust the program, that is where the change should be.

SMillerDev commented 1 year ago

The justification is this: Nextcloud news should not try and circumvent restrictions by misrepresenting the user agent. It's the wrong use of a user agent and your suggestion would allow for extensive fingerprinting of users. All for a temporary benefit until the authors of these feeds find a better regex to block Nextcloud.

lbdroid commented 1 year ago

The user/administrator should always have control over things like user agent.

SMillerDev commented 1 year ago

I've looked through the documentation online and they all seem to agree that whatever you make your connection with should set it. Do you have some documentation to support your claim that the user should always control this?

harlows commented 1 year ago

I'm having the same problem. For me, it presents when I try to add many WordPress-generated feeds. If I rinse the feeds through a service like RSS-proxy they import.

IgorA100 commented 1 year ago

I thought about this problem for a long time. @SMillerDev (as a developer) and @lbdroid (as a user) are both right There are probably maniacal admins of RSS feeds, but you want to read them! In any case, the final word remains with the developer. But I would like to change User Agent in the settings.