thisisparker / xword-dl

⬛⬜⬛ Command line tool to scrape crosswords from online solvers and save them as .puz files ⬛⬜⬛
MIT License
140 stars 30 forks source link

Can xword-dl download NYT variety puzzles? #59

Closed eigenfoo closed 1 year ago

eigenfoo commented 1 year ago

(I've manually edited my xword-dl.yaml to circumvent #58)

# python xword-dl/xword_dl.py nyt --latest
Puzzle downloaded and saved as NY Times - 20221011.puz.

# python xword-dl/xword_dl.py https://www.nytimes.com/crosswords/game/variety/2022/10/02
Unable to find a puzzle at https://www.nytimes.com/crosswords/game/variety/2022/10/02.

I've determined that this is likely because NewYorkTimesDownloader isn't in supported_sites: https://github.com/thisisparker/xword-dl/blob/bbb4877300be6e25d25a92aefba728fbacabca18/xword_dl.py#L76-L78

However, adding ('nytimes.com', NewYorkTimesDownloader) to the list produces a JSON error, which I don't think I'm well-equipped to make sense of:

# python xword-dl/xword_dl.py https://www.nytimes.com/crosswords/game/variety/2022/10/02
Traceback (most recent call last):
  File "/home/george/pandas/xword-dl/xword_dl.py", line 1162, in <module>
    main()
  File "/home/george/pandas/xword-dl/xword_dl.py", line 1145, in main
    puzzle, filename = by_url(args.source,
  File "/home/george/pandas/xword-dl/xword_dl.py", line 104, in by_url
    puzzle = dl.download(puzzle_url)
  File "/home/george/pandas/xword-dl/xword_dl.py", line 267, in download
    xword_data = self.fetch_data(solver_url)
  File "/home/george/pandas/xword-dl/xword_dl.py", line 971, in fetch_data
    return res.json()['results'][0]
  File "/home/george/miniconda3/lib/python3.9/site-packages/requests/models.py", line 910, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home/george/miniconda3/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/george/miniconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/george/miniconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
thisisparker commented 1 year ago

It is possible in theory (for the subset of puzzles that can be represented in .puz files) but would require a little more code — namely, some glue to map from the URL to the underlying puzzle data. As it stands, when xword-dl is downloading an NYT puzzle it goes through a mostly undocumented oracle and downloads a JSON file directly.

That oracle step is necessary because there is a behind-the-scenes mapping of dates to puzzle IDs for the daily crossword, and the puzzle ID is needed to request the right JSON, but fortunately for Variety it looks like there is a more direct transformation of the URL to the underlying data (e.g., it looks like the JSON data for the puzzle you were seeking is at https://www.nytimes.com/svc/crosswords/v6/puzzle/variety/2022-10-02.json). I haven't looked to see if that file is structured the same as the daily crossword JSON, which would determine whether you could use the existing parsing function.

So! It would require a little more code and I think I'd probably want to structure it as a subclass of the existing NewYorkTimesDownloader instead of expanding that class much more, but that seems doable. If you're interested I could add that, and you'd probably get by-url downloading of daily NYT puzzles "for free," which would be nice.

eigenfoo commented 1 year ago

Thank you for explaining!

I haven't looked to see if that file is structured the same as the daily crossword JSON, which would determine whether you could use the existing parsing function.

Looking further, I see that this particular JSON (for a puns and anagrams puzzle, which theoretically could be represented as a .puz file) doesn't have the same structure that NewYorkTimesDownloader.parse_xword is expecting: for example, it has neither puzzle_meta nor puzzle_data fields. So, a new parsing function would be necessary.

Would you be willing to take on this additional work, beyond the simple glue to translate the date to the JSON URL? I don't want to burden you with a much larger feature request.

For more background: I am interested in puns and anagrams and cryptic crossword downloads for https://github.com/eigenfoo/cryptics (both puzzle types can be represented as .puz files). Now that I know how to look up the JSONs though, I'm happy to just curl and sit on them until I get a chance to write parsing code (which, to be transparent, won't be anytime soon, since the NYT constitutes a very low volume of cryptic crosswords). Completely your call!

thisisparker commented 1 year ago

I think the answer is "possibly," although if you wrote the parsing code first I'd probably be just as happy to incorporate it. The longer answer is there's a big refactor that I've been meaning to do on xword-dl for like.... many months now, and it will make working with additional Downloaders easier, and so I hesitate before doing anything around those... but at some point soon it's going to happen and then it will be an easy call to say yes.

thisisparker commented 1 year ago

Just a heads up: I recently completed the refactor described above and I think I will get a chance to add NYT Variety support this week.

thisisparker commented 1 year ago

Closed in v2022.11.16 🎉