thefranke / rss-librarian

A read-it-later service for RSS purists
https://alternator.hstn.me/librarian.php
MIT License
13 stars 2 forks source link

Dealing with paywalled/not easily extractable articles #1

Open faximan opened 12 hours ago

faximan commented 12 hours ago

Hey there, I stumbled upon this project when also looking to self-host an RSS feed of random articles that I want to read in NetNewsWire. Love it!

It works great, the only minor problem is that I haven't found a way deal with articles like https://www.outsideonline.com/culture/essays-culture/instagram-travel-influencers-yosemite/. When extracting this URL, you only get the beginning of the article (same in things like built-in browser "reader modes"), not the full content.

Previously, I have solved this by going through e.g. archive.is to get a URL I can throw into Instapaper (https://archive.is/IRYN9) but this URL is completely failing in rss-librarian. I get [unable to retrieve full-text content]. Maybe this is a FiveFilters issue?

Anyways, I was curious if you have any strategy to deal with such URLs. Another one (properly paywalled) would be https://www.nytimes.com/2024/09/13/technology/elon-musk-security.html.

faximan commented 11 hours ago

I worked around the archive.is specific issue by using the trick in https://stackoverflow.com/questions/11680709/file-get-contents-give-me-403-forbidden.

        if (file_exists($autoload))
        {
            require $autoload;
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
            $html = curl_exec($ch);
            curl_close($ch);
        }

And then remove the second

$html = file_get_contents($url);

further below (seems like a bug?)

Now I get the full text content.

faximan commented 9 hours ago

Proposed some changes in https://github.com/thefranke/rss-librarian/pull/2.