kebernet / shortyz

Shortyz Crosswords
GNU General Public License v3.0
106 stars 56 forks source link

Chronicle of Higher Education not downloading #120

Open reubengann opened 6 years ago

reubengann commented 6 years ago

The puzzles at the Chronicle of Higher Education are still freely available (https://www.chronicle.com/section/Crosswords/43) but my Shortyz isn't downloading them and hasn't for several months. I can manually download them and move them into my crosswords folder, so there's nothing wrong with the puz files.

I'm on the latest version on a Pixel 1, Android 8.0.0

ThatGitJer commented 6 years ago

Same issue for me - last successful in-game download was Oct 27.

The URLs haven't changed, so I suspect the server is blocking "hot-linking" (directly accessing puz without visiting website). Might be fixed by adding "Referer" line to HTTP header (or is it "Referrer"?), with either "chronicle.com" or the full "www.chronicle.com/section/Crosswords/43". Could also be the "User-agent" value that's causing the problem.

dro2 commented 6 years ago

Is there a similar site for manually downloading the Wall Street Journal puzzles? Shortyz often doesn't download them for me. Manually downloading the Chronicle's puzzles works great.

ThatGitJer commented 6 years ago

There's this: https://blogs.wsj.com/puzzle/ As far as I can tell, it only allows you to download as PDF or play online.

You could also try directly accessing the URL that Shortyz uses: http://herbach.dnsalias.com/wsj/wsj180119.puz This is for the 2018/01/19 puzzle; edit the url for other dates. Plus, as I just discovered, it works for days other than Friday & Saturday! (WSJ has been publishing Mon-Sat since mid-2016)

There is an open issue for this problem: https://github.com/kebernet/shortyz/issues/74 (FWIW, I personally can't recall ever having any problems)

ThatGitJer commented 6 years ago

Regarding the Chronicle - I'm now thinking it's to do with the switch from HTTP to HTTPS. The web browser automatically redirects, but Shortyz doesn't seem to handle this properly.

mr-salty commented 6 years ago

I took a crack at this and the problem appears to be they're actively blocking the requests from shortyz (well, probably not shortyz specifically, but automated downloads).

The response code you get back is 416 "Requested Range Not Satisfiable", despite the fact we're not sending a Range in the request.

I changed the User-Agent to match Chrome, and possibly copied some other headers:

headers.put("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36");

That got rid of the 416, you get a successful response, but it contains javascript that redirects you to this page saying you're blocked: http://www.chronicle.com/distil_r_blocked.html

(FWIW 'wget' also gets the 416, while 'curl' gets the javascript/block page)

I copied all of the Chrome headers exactly except for Accept-Encoding and Cookie, and still had no luck.

I was briefly excited when I changed Accept-Encoding to 'gzip, deflate' like Chrome - but although I got back gzip, if you set that header yourself it completely inhibits okhttp's built-in gunzip. So, I figured out how to decompress it myself and realized it was just a gzipped version of the same javascript, rather than the puzzle.

So, as a last ditch effort I sent the cookies that Chrome was sending - and lo and behold the download worked. But, I don't think hard-coding a fixed set of cookies is a viable solution, and I don't know how to make shortyz get a "legit" set of cookies.

ThatGitJer commented 6 years ago

That's odd - you are using HTTPS with port 443? And did you try adding a Referer line to the request header?

My revised theory was based on info gleaned from my VPN firewall app (NetGuard). Shortyz makes 2 requests: the first to chronicle.com:80, the second to www.chronicle.com:80. In Opera, the first request is to www.chronicle.com:80, but the second is to www.chronicle.com:443 - an HTTPS request.

NetGuard also has pcap ability, but it's a bit broken so all I get are the start of packets. But I can see that Shortyz's first request gets a "301 Moved Permanently" response, which presumably would include a forwarding address. My hunch is that this would include either https:// or specify port 443, which Opera handles correctly. However, Shortyz fails to recognize this and uses regular http with port 80 in the second attempt again, which results in the 416 code for some reason (this shows up in the pcap as "HTTP/1.1 416 Requested R" [sic]).

However, you might already be a few steps ahead of this. But I thought it was worth mentioning that the 416 result code shows up in this sequence as well.

mr-salty commented 6 years ago

Yeah, I had changed the code to use https:// (and also added a "www." prefix to avoid the redirect, although the okhttp library is supposed to handle that).

I was just looking at data in the debugger (and Chrome's console to see what it was sending), because part of the point of this for me is learning Java; I've been writing C/C++ for 25+ years but working on Android apps makes me feel like a complete n00b. I might try Wireshark or something if I pick it up again.

But, this distil stuff is expressly for "bot detection" so what they're doing is definitely deliberate. So, I guess I should be happy that I was able to bypass it, but I'm sure if I hardcoded my current cookies into shortyz they would get blocked soon enough. I'll try clearing the cookies in Chrome to see what happens when it makes a request with no cookies and maybe we can emulate that in shortyz. Maybe one of the other downloaders already does? I didn't look at all of them.

Maybe shortyz could just invoke chrome to download the file; I'm not sure if that's viable (can it be done "seamlessly"?)

Really, I took a look at this bug hoping it was going to be a 10 minute fix but it turned out not to be... unless I'm missing something.