miniflux / v2

Minimalist and opinionated feed reader
https://miniflux.app
Apache License 2.0
6.88k stars 721 forks source link

Can't subscribe to Hacker News feed #117

Closed somini closed 5 years ago

somini commented 6 years ago

https://news.ycombinator.com/news, feed URL: https://news.ycombinator.com/rss

This gives a Unable to fetch feed (statusCode=503) error code.

I tested by putting the feed in a OPML file and importing that and it succeed.

etiennecrb commented 6 years ago

I have the same issue. But these feeds work.

fguillot commented 6 years ago

The problem is related to Cloudflare, this is probably their anti-bot system. The first HTTP request to discover the website/icon is working but not the second one to fetch the feed.

fguillot commented 6 years ago

Looks like Cloudflare is doing some rate limiting for Hacker News website, if you make more than one HTTP request in one second, they will block the other one.

You could try with curl:

curl -I https://news.ycombinator.com/rss && curl -I https://news.ycombinator.com/rss
HTTP/1.1 200 OK
Date: Wed, 09 May 2018 04:53:49 GMT
Content-Type: application/rss+xml
Connection: keep-alive
Set-Cookie: __cfduid=d361ca5058ea776bc63102d05db4f123f1525841629; expires=Thu, 09-May-19 04:53:49 GMT; path=/; domain=.ycombinator.com; HttpOnly
Cache-Control: private; max-age=0
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Referrer-Policy: origin
Strict-Transport-Security: max-age=31556900
Content-Security-Policy: default-src 'self'; script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/; frame-src 'self' https://www.google.com/recaptcha/; style-src 'self' 'unsafe-inline'
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 4181908598a23b20-XXX

HTTP/1.1 503 Service Temporarily Unavailable
Date: Wed, 09 May 2018 04:53:49 GMT
Content-Type: text/html
Content-Length: 537
Connection: keep-alive
Set-Cookie: __cfduid=dc270715d50b8423d7a9be5269c3acd6e1525841629; expires=Thu, 09-May-19 04:53:49 GMT; path=/; domain=.ycombinator.com; HttpOnly
ETag: "5a2ce78d-219"
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 418190871cdb3b38-XXX

But if you wait one second between each request this is working:

$ curl -I https://news.ycombinator.com/rss && sleep 1 && curl -I https://news.ycombinator.com/rss
HTTP/1.1 200 OK
Date: Wed, 09 May 2018 04:55:22 GMT
Content-Type: application/rss+xml
Connection: keep-alive
Set-Cookie: __cfduid=d75e2370c388bec136c7925edfb226e8d1525841721; expires=Thu, 09-May-19 04:55:21 GMT; path=/; domain=.ycombinator.com; HttpOnly
Cache-Control: private; max-age=0
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Referrer-Policy: origin
Strict-Transport-Security: max-age=31556900
Content-Security-Policy: default-src 'self'; script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/; frame-src 'self' https://www.google.com/recaptcha/; style-src 'self' 'unsafe-inline'
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 418192c4cabd3b20-XXX

HTTP/1.1 200 OK
Date: Wed, 09 May 2018 04:55:23 GMT
Content-Type: application/rss+xml
Connection: keep-alive
Set-Cookie: __cfduid=de951e2f9a929c59df8ff36796932437e1525841723; expires=Thu, 09-May-19 04:55:23 GMT; path=/; domain=.ycombinator.com; HttpOnly
Cache-Control: private; max-age=0
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Referrer-Policy: origin
Strict-Transport-Security: max-age=31556900
Content-Security-Policy: default-src 'self'; script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/; frame-src 'self' https://www.google.com/recaptcha/; style-src 'self' 'unsafe-inline'
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 418192d31c5a3b50-XXX
w2ak commented 6 years ago

Hello! I also have a similar problem but I wasn't able to debug it. Feed URL: https://www.rts.ch/la-1ere/programmes/les-beaux-parleurs/podcast/?flux=rss

At first I had imported it with OPML so it was in my subscriptions but Refresh said "error code 403" (which does not correspond with what I get when opening directly the feed). Then I removed the imported subscription, and when creating a new one with this URL I get the error "Unable to find any subscription".

isavegas commented 6 years ago

@w2ak Looks like it's an invalid RSS feed. w3 validator

fguillot commented 6 years ago

@w2ak Your issue is not exactly the same as the original one. They, I mean Akamai are blocking Miniflux based on headers sent by the HTTP client. You can simulate this behavior with curl:

curl -v -H "User-Agent: Mozilla/5.0 (compatible; Miniflux/2.0.7; +https://miniflux.net)" -H "Accept: */*" "https://www.rts.ch/la-1ere/programmes/les-beaux-parleurs/podcast/?flux=rss"
*   Trying 104.64.112.165...
* TCP_NODELAY set
* Connected to www.rts.ch (104.64.112.165) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate: *.rts.ch
* Server certificate: DigiCert SHA2 High Assurance Server CA
* Server certificate: DigiCert High Assurance EV Root CA
> GET /la-1ere/programmes/les-beaux-parleurs/podcast/?flux=rss HTTP/1.1
> Host: www.rts.ch
> User-Agent: Mozilla/5.0 (compatible; Miniflux/2.0.7; +https://miniflux.net)
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Server: AkamaiGHost
< Mime-Version: 1.0
< Content-Type: text/html
< Content-Length: 338
< Expires: Fri, 18 May 2018 03:44:09 GMT
< Date: Fri, 18 May 2018 03:44:09 GMT
< Connection: close
<
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "http&#58;&#47;&#47;www&#46;rts&#46;ch&#47;la&#45;1ere&#47;programmes&#47;les&#45;beaux&#45;parleurs&#47;podcast&#47;&#63;" on this server.<P>
Reference&#32;&#35;18&#46;d5eafea5&#46;1526615049&#46;ae5d7d6
</BODY>
</HTML>

Removing one of the header User-Agent or Accept make it works. But these headers are valid. The web is a nasty place.

allanbreyes commented 6 years ago

FWIW, another workaround is to use https://github.com/edavis/hnrss, self-hosted or via hnrss.org. I use it to get more granular control of HN feeds, but it might be worth seeing how they avoid rate-limiting.

somini commented 6 years ago

Cloudflare strikes again. Thanks for the debug @fguillot .

I'm running Miniflux on a RPi3 for myself, so I wanted to avoid having to install many dependencies on it. At least it's not PHP...

somini commented 6 years ago

It seems hnrss was rewritten in Go: https://github.com/edavis/go-hnrss

I forked it over at Gitlab and setup the CI so that binaries can be automatically built. Anyone can reuse the binaries, you can confirm the only commits I did was configuring the CI. https://gitlab.com/somini/go-hnrss I run this on a separate port, configure nginx to proxy it and subscribe to the feeds as usual.

This is good enough for me, so @fguillot can close this, or add it on the docs.