Any Google Alerts RSS, not fetching and crawling the source page.

doorkey commented 6 years ago

Any Idea how we can fix that ? May be Google put this on purpose but is there any work around using the scrap rules? Thanks alot for any help.

fguillot commented 6 years ago

You should try to dig into it to understand what happens. Download the page via curl and check what you got. Then try to adjust the content scraper by using a custom rule.

doorkey commented 6 years ago

I ran curl on the RSS feed google provide. Here is the result.

Response body EXAMPLE ( Just showing you the relevant response)

<title>Google Alert - work+visa</title> <link href="https://www.google.com/alerts/feeds/16468676419519574922/6359205768005350255" rel="self"></link>  <entry>  <title type="html">Malaysian former minister ready to face probe over Nepali &lt;b&gt;workers&lt;/b&gt;&amp;#39; scam</title> 
<link href="https://www.google.com/url?rct=j&amp;sa=t&amp;url=http://kathmandupost.ekantipur.com/news/2018-07-24/malaysian-former-minister-ready-to-face-probe-over-nepali-workers-scam.html</link></entry>

So Apparantly, the response has the source link contains in the tag <link></link> in the second line above which has url. Can you help me how to write a custom rule for this to extract url ? OR what are the methods we can use in custom rules. I only see input box there. Any help would highly deeply appreciated.

Cheers. I deeply thank you for this effort sir.

fguillot commented 6 years ago

You posted the XML feed content, not the entry URL.

If your example the first entry URL is https://www.google.com/url?rct=j&sa=t&url=http://kathmandupost.ekantipur.com/news/2018-07-24/malaysian-former-minister-ready-to-face-probe-over-nepali-workers-scam.html&ct=ga&cd=CAIyGmI0Yjc3ZjZhYTIxYTZmNGM6Y29tOmVuOlVT&usg=AFQjCNFMgA66XKXPMS5Dt0SMG4iBmdshyA.

If you do a curl -v https://www.google.com/url?rct=j&sa=t&url=http://kathmandupost.ekantipur.com/news/2018-07-24/malaysian-former-minister-ready-to-face-probe-over-nepali-workers-scam.html&ct=ga&cd=CAIyGmI0Yjc3ZjZhYTIxYTZmNGM6Y29tOmVuOlVT&usg=AFQjCNFMgA66XKXPMS5Dt0SMG4iBmdshyA:

> GET /url?rct=j&sa=t&url=http://kathmandupost.ekantipur.com/news/2018-07-24/malaysian-former-minister-ready-to-face-probe-over-nepali-workers-scam.html&ct=ga&cd=CAIyGmI0Yjc3ZjZhYTIxYTZmNGM6Y29tOmVuOlVT&usg=AFQjCNFMgA66XKXPMS5Dt0SMG4iBmdshyA HTTP/1.1
> Host: www.google.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Wed, 25 Jul 2018 00:35:09 GMT
< Pragma: no-cache
< Expires: Fri, 01 Jan 1990 00:00:00 GMT
< Cache-Control: no-cache, must-revalidate
< Content-Type: text/html; charset=ISO-8859-1
< P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
< Server: gws
< X-XSS-Protection: 1; mode=block
< Set-Cookie: NID=135=i6-BTuN1Ho5zwt8GOImVJShxKntqM3sb3Qu_YOryr7SAgc0eV7mBo96q0UnyUyFsFUfhNCanC8F7l1qYmmN3VakqSlzOi0H9Pe4F-42fwXgcTahmCAT2BuwBYh4iDAi6; expires=Thu, 24-Jan-2019 00:35:09 GMT; path=/; domain=.google.com; HttpOnly
< Alt-Svc: quic=":443"; ma=2592000; v="44,43,39,35"
< Accept-Ranges: none
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
<
<script>window.googleJavaScriptRedirect=1</script><script>var n={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};n.navigateTo(window.parent,window,"http://kathmandupost.ekantipur.com/news/2018-07-24/malaysian-former-minister-ready-to-face-probe-over-nepali-workers-scam.html");
</script><noscript><META http-equiv="refresh" content="0;URL='http://kathmandupost.ekantipur.com/news/2018-07-24/malaysian-former-minister-ready-to-face-probe-over-nepali-workers-scam.html'"></noscript>

Google is not using a standard HTTP redirect but instead use some Javascript. Miniflux won't be able to fetch the linked page.

What could be done, is developing a specific rewrite mechanism that extract the final URL from Google's Alert feed and use it as entry URL. But that require some changes in the code base.

doorkey commented 6 years ago

Bro, I really appreciate the effort and response.

So I have a dev working with me. Any pointer where she should work / look into on to change the codebase in order to get this entry URL?

Again. I really appreciate.

Cheers.

doorkey commented 6 years ago

We could share the code base with the community so anyone who is looking into google alerts could find our solution useful :)

somini commented 6 years ago

From that code, I think the best bet would be to create a Scraper Rule or Rewrite Rule that parsed that XML and changed the entry's URL. It's available on the noscript tag, seems easy enough to parse.

fguillot commented 5 years ago

If you guys still need this feature, then send a pull-request.

miniflux / v2

Any Google Alerts RSS, not fetching and crawling the source page. #186