thisisparker / xword-dl

⬛⬜⬛ Command line tool to scrape crosswords from online solvers and save them as .puz files ⬛⬜⬛
MIT License
139 stars 30 forks source link

WSJ downloads failing because of anti-scraping mechanism #178

Open thisisparker opened 5 months ago

thisisparker commented 5 months ago

WSJ is returning 401/403 errors to requests from requests, including xword-dl. My guess is that this is in response to traffic patterns they're seeing and they will turn it off again in due course, but that's a waiting game.

In the meantime, the error message should probably differentiate between this kind of connection error and a parsing error (which is what everything sounds like now).

thisisparker commented 5 months ago

Looking into it: this appears to be operated by a company called Datadome and they're setting and checking a cookie called datadome with a long token value. Theoretically we could provide that value with requests similar to an auth token, but I'd rather not have to do that. Still hoping this is temporary!

thisisparker commented 4 months ago

Maybe fixed this with #183, though I'm not thrilled with maintaining a list of random cookies that are required for each site and I don't know how long datadome cookies last anyway. Leaving open for now :roll_eyes:

thisisparker commented 3 months ago

Unsurprisingly, datadome tokens turn out to be very short-lived—on the order of hours, I guess? Maybe back to the drawing board here.