palewire / news-homepages

An open-source archive that gathers, saves, shares and analyzes news homepages
https://homepages.news
GNU General Public License v3.0
126 stars 16 forks source link

[Fix site]: WSJ gets a captcha since January; no links #486

Open jeremybmerrill opened 1 month ago

jeremybmerrill commented 1 month ago

Screenshot

wsj

Screenshot via https://palewi.re/docs/news-homepages/sites/wsj.html. AFAIK this has been going since 2024-01-16 16:11:00 (based on the last non-empty links JSON.) I wonder if this is something that could be rectifying (and backfilled) by scraping Internet Archive screenshots, or by asking WSJ to allowlist you. I also wonder if the captcha is part of WSJ's anti-AI-training-data-scraping efforts.

Have you circumvented captchas in other sites?

A solution is often found by adding JavaScript or CSS via a site-specific include, as covered in our documentation.

retaining this boilerplate from the template :)

palewire commented 1 month ago

I'm aware of this bug, but I haven't even begun to think about how to fix it. I'm open to patches, and connex to people at WSJ to consult.

jeremybmerrill commented 1 month ago

Fair enough. Curious why you're gathering the pages yourself, rather than getting them from the Internet Archive (which appears to have circumvented the WSJ's limitations.)

palewire commented 1 month ago

It's probably a longer story than anyone wants to hear. I launched the site in 2012 independent of archive.org as a self-hosted service funded by a Kickstarter campaign. At that time the Wayback Machine was not archiving the homepages of major sites with much frequency. There have been several evolutions since, with the current one hosting assets for free with IA's generous "collections" system.

It would be possible to re-engineer the site to act as a supplement to Wayback's page captures. And perhaps I should move towards such a system. In the 12 years since I started, IA has ramped up its capturing rate for big sites. Though that's not always the case for lower trafficked sites.

jeremybmerrill commented 1 month ago

No, that's very useful context, thank you! It feels like the idea I'd be most open to implementing: scraping IA, would be a larger re-engineering of the project that I can't really take on.

jeremybmerrill commented 3 weeks ago

Just flagging that Reuters has the same issue as of 2023-12-04 03:59:00