WSJ downloads failing because of anti-scraping mechanism

thisisparker / xword-dl

⬛⬜⬛ Command line tool to scrape crosswords from online solvers and save them as .puz files ⬛⬜⬛

MIT License

149 stars 32 forks source link

WSJ downloads failing because of anti-scraping mechanism #178

Open thisisparker opened 10 months ago

thisisparker commented 10 months ago

WSJ is returning 401/403 errors to requests from requests, including xword-dl. My guess is that this is in response to traffic patterns they're seeing and they will turn it off again in due course, but that's a waiting game.

In the meantime, the error message should probably differentiate between this kind of connection error and a parsing error (which is what everything sounds like now).

thisisparker commented 10 months ago

Looking into it: this appears to be operated by a company called Datadome and they're setting and checking a cookie called datadome with a long token value. Theoretically we could provide that value with requests similar to an auth token, but I'd rather not have to do that. Still hoping this is temporary!

thisisparker commented 9 months ago

Maybe fixed this with #183, though I'm not thrilled with maintaining a list of random cookies that are required for each site and I don't know how long datadome cookies last anyway. Leaving open for now :roll_eyes:

thisisparker commented 9 months ago

Unsurprisingly, datadome tokens turn out to be very short-lived—on the order of hours, I guess? Maybe back to the drawing board here.

crosswordnexus commented 4 months ago

You don't want to just pull from Martin Herbach's site? http://herbach.dnsalias.com/wsj/wsj240720.puz

thisisparker commented 4 months ago

Nope, not in xword-dl itself. Obviously that's a good option for end users who want it, but I've made the design decision that this tool only uses first-party sources and does its own scraping and parsing.