pictuga / morss

Get full text RSS feeds
https://morss.it/
GNU Affero General Public License v3.0
589 stars 77 forks source link

Full text can't be fetched from certain news sites (example included) #128

Open BryanWall opened 3 months ago

BryanWall commented 3 months ago

I tried to create a feed using the complex CSS selector in RSS-bridge but was unable to get it to fetch the full text. Full text of articles is viewable through a browser, but this is one of those sites that only lets you view a certain number of articles before throwing up a paywall. It seems to only use cookies for that, however, so you can use a private browsing session to get around it.

The way they are formatting the site prevents RSS-bridge from getting the full text, however. The "hidden" content is just stored in divs with class="subscriber-only", which I assume they hide with CSS once you've exceeded the limit on articles. The text is not removed from the page, however.

When I couldn't figure it out in RSS-bridge I googled for solutions and found morss. I tried creating a feed to just fetch the article links and then use that with morss, but morss runs into the same issue getting the full text. Here's a sample of the RSS-bridge feed that I made that you can try with morss to see the issue.

https://rss-bridge.org/bridge01/?action=display&bridge=XPathBridge&url=https%3A%2F%2Fwww.paducahsun.com%2Fsearch%2F%3Fk%3D%2522mccracken%2520county%2520public%2520schools%2522%23tncms-source%3Dkeyword&item=%2F%2Farticle&title=.%2F%2Fa%2F%40aria-label&content=.%2F%2Fp%2Ftext%28%29&uri=.%2F%2Fa%5B%40class%3D%22tnt-asset-link%22%5D%2F%40href&author=&timestamp=.%2F%2Ftime%2F%40datetime&enclosures=.%2F%2Fimg%2F%40data-srcset%5B1%5D&categories=&format=Json

The CMS being used is Blox CMS (https://www.help.bloxdigital.com/blox_cms/community/access_control/). It is used by hundreds of newspaper and TV station web sites. Figuring out a way to fetch the full text from one of these would probably fix every site that uses this same CMS.

BryanWall commented 3 months ago

This app has code for bypassing the paywalls on this site to load the full text:

https://github.com/bpc-clone?tab=repositories