Dealing with paywalled/not easily extractable articles

faximan commented 1 month ago

Hey there, I stumbled upon this project when also looking to self-host an RSS feed of random articles that I want to read in NetNewsWire. Love it!

It works great, the only minor problem is that I haven't found a way deal with articles like https://www.outsideonline.com/culture/essays-culture/instagram-travel-influencers-yosemite/. When extracting this URL, you only get the beginning of the article (same in things like built-in browser "reader modes"), not the full content.

Previously, I have solved this by going through e.g. archive.is to get a URL I can throw into Instapaper (https://archive.is/IRYN9) but this URL is completely failing in rss-librarian. I get [unable to retrieve full-text content]. Maybe this is a FiveFilters issue?

Anyways, I was curious if you have any strategy to deal with such URLs. Another one (properly paywalled) would be https://www.nytimes.com/2024/09/13/technology/elon-musk-security.html.

faximan commented 1 month ago

I worked around the archive.is specific issue by using the trick in https://stackoverflow.com/questions/11680709/file-get-contents-give-me-403-forbidden.

        if (file_exists($autoload))
        {
            require $autoload;
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
            $html = curl_exec($ch);
            curl_close($ch);
        }

And then remove the second

$html = file_get_contents($url);

further below (seems like a bug?)

Now I get the full text content.

faximan commented 1 month ago

Proposed some changes in https://github.com/thefranke/rss-librarian/pull/2.

thefranke commented 3 weeks ago

Hey @faximan, saw your posts just now! Thank you very much, I'll take a look at it shortly!

thefranke commented 3 weeks ago

Did a quick test and it seems to work fine, thank you for the fixes.

I plan to add one more step when creating a first feed so that users do not easily create and abandon feeds. Let me know if you have any additional ideas as well.

thefranke / rss-librarian

Dealing with paywalled/not easily extractable articles #1