thisisparker / xword-dl

⬛⬜⬛ Command line tool to scrape crosswords from online solvers and save them as .puz files ⬛⬜⬛
MIT License
139 stars 30 forks source link

Error download 15 Dec 2023 Crossword Club crossword #155

Closed fncll closed 6 months ago

fncll commented 6 months ago

I am unable to download the 15 Dec 2023 Crossword Club puzzle. I'm guessing because there's an emoji in a clue (15D)? But I'm not a coder :)

(xword-dl-env) bash-5.2$ xword-dl --version
2023.12.2
(xword-dl-env) bash-5.2$ xword-dl club -d 12/15/23
Traceback (most recent call last):
  File "/Users/chris/Library/CloudStorage/Dropbox/crosswords/xword-dl-env/bin/xword-dl", line 33, in <module>
    sys.exit(load_entry_point('xword-dl==2023.12.2', 'console_scripts', 'xword-dl')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/Library/CloudStorage/Dropbox/crosswords/xword-dl-env/lib/python3.11/site-packages/xword_dl-2023.12.2-py3.11.egg/xword_dl/xword_dl.py", line 243, in main
    save_puzzle(puzzle, filename)
  File "/Users/chris/Library/CloudStorage/Dropbox/crosswords/xword-dl-env/lib/python3.11/site-packages/xword_dl-2023.12.2-py3.11.egg/xword_dl/util/utils.py", line 28, in save_puzzle
    puzzle.save(filename)
  File "/Users/chris/Library/CloudStorage/Dropbox/crosswords/xword-dl-env/lib/python3.11/site-packages/puzpy-0.2.5-py3.11.egg/puz.py", line 225, in save
    puzzle_bytes = self.tobytes()
                   ^^^^^^^^^^^^^^
  File "/Users/chris/Library/CloudStorage/Dropbox/crosswords/xword-dl-env/lib/python3.11/site-packages/puzpy-0.2.5-py3.11.egg/puz.py", line 240, in tobytes
    self.global_cksum(), ACROSSDOWN.encode(ENCODING),
    ^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/Library/CloudStorage/Dropbox/crosswords/xword-dl-env/lib/python3.11/site-packages/puzpy-0.2.5-py3.11.egg/puz.py", line 369, in global_cksum
    cksum = self.text_cksum(cksum)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/Library/CloudStorage/Dropbox/crosswords/xword-dl-env/lib/python3.11/site-packages/puzpy-0.2.5-py3.11.egg/puz.py", line 357, in text_cksum
    cksum = data_cksum(clue.encode(ENCODING), cksum)
                       ^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f525' in position 0: ordinal not in range(256)

TIA!

thisisparker commented 6 months ago

Similar issue on today's (1/5/23) New Yorker. It is definitely related to the emoji, but I thought it had been able to handle emoji in some way before? In any case, it should probably replace emoji with something, and it can do that in the unidecode step, but I have to figure out what makes sense.

Maybe square brackets, all-caps, EMOJI at the end? [FIRE EMOJI] [FACE WITH TEARS OF JOY EMOJI]

thisisparker commented 6 months ago

OK actually: I hunted down the error and it's a little more "fun" than I'd previously imagined. The issue is that the emoji in Amuse solvers are properly escaped as HTML entities, so the :fire: is encoded as &#x1f525;. I was then running that text through unidecode (which leaves the escaped entity intact) and then through html2text, which has the surprising but correct result of converting it into :fire:.

One simple way to solve this would be to reverse the processing order, and do the html2text step and then the unidecode step. (I don't remember why I did it the other way, but I think I had a good reason...) That would produce a valid puz file, I think, but would result in the character just getting stripped out.

I think maybe the better solution is to introduce a new dependency that translates emoji into text. Since sending the above comment I realized that the obvious right way of making that conversion is to put it between colons, like :fire:. There is a crossword-specific problem with that, though, which is that the emoji name may dupe the clued word. Thinking about that...

aanker commented 6 months ago

I don’t know if this helps your thought process but I have definitely seen situations where a puzzle has a PDF version with an emoji clue and then a PUZ version with the name of the emoji as the clue (which is usually also the answer to said clue). The first few times I saw it I was completely confused because I had no idea why someone would clue the answer as a clue until I happened to see a PDF version of one of the puzzles and realized it was an emoji conversion issue.

Which is a long winded way of saying I think — short of creating a map of emojis to new clues that you write yourself — you’re going to be stuck just having to convert emojis to their canonical name which will result in them often being clued as their answers. Until we change PUZ, emoji clues are unfortunately a hack that doesn’t translate.