thisisparker / xword-dl

⬛⬜⬛ Command line tool to scrape crosswords from online solvers and save them as .puz files ⬛⬜⬛
MIT License
139 stars 30 forks source link

Support for derstandard.at crossword puzzles #39

Closed drauch closed 2 years ago

drauch commented 2 years ago

It's an AmuseLabs based crossword puzzle, e.g., you can see one here: https://www.derstandard.at/story/2000128976430/kreuzwortraetsel-d-9870

However, your AmuseLabs detector doesn't detect a puzzle, possibly because you have to confirm a "yes, watch page with ads" dialog before seeing the actual puzzle.

Is there a good way to use your tool to download those puzzles? If necessary, and you know how, I can also provide a pull request if you point me into the right direction. However, I'm not sure how and if you can accept such a "view with ads" dialog.

Best regards, D.R.

thisisparker commented 2 years ago

The easiest way to do it is to get the URL of the iframe itself (in the case of the example you've given, it's https://cdn-eu1.amuselabs.com/pmm/crossword?id=fdb23f25&set=phoenixen&embed=1) and then pass that to xword-dl directly. Two notes on that: you'll almost definitely need to put the URL in quotes, and you'll need to install the very latest version from this repo, as I wasn't properly sanitizing the puzzle's author or copyright information until bd3e5ec11779aae3dc7985e7f1e72689dea90608. I'm not sure off the top of my head how you'd bypass the modal, but if this works for you then maybe that's enough!

drauch commented 2 years ago

Thanks so far. I installed the latest version and used the IFRAME address as you told me, however, I run into the following problem:

xword-dl "https://cdn-eu1.amuselabs.com/pmm/crossword?id=5a760e1e&set=phoenixen&embed=1"
Traceback (most recent call last):
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\Scripts\xword-dl.exe\__main__.py", line 7, in <module>
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\xword_dl.py", line 1091, in main
    save_puzzle(puzzle, filename)
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\xword_dl.py", line 119, in save_puzzle
    puzzle.save(filename)
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\puz.py", line 225, in save
    puzzle_bytes = self.tobytes()
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\puz.py", line 240, in tobytes
    self.global_cksum(), ACROSSDOWN.encode(ENCODING),
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\puz.py", line 369, in global_cksum
    cksum = self.text_cksum(cksum)
  File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\puz.py", line 353, in text_cksum
    cksum = data_cksum(self.copyright.encode(ENCODING) + b'\0', cksum)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 114: ordinal not in range(256)

It looks like the crossword is using some kind of special dash character which cannot be encoded in latin1. Is there a reason why we don't use UTF8/Unicode for everything?

Best regards, D.R.

thisisparker commented 2 years ago

Can you confirm you've installed from this Github repository directly? I believe the error you're encountering is the one that I said in the last message I fixed, but that fix is not yet in the most recent "shipped" version on PyPI.

The .puz format specifies ISO 8859-1 for strings.

drauch commented 2 years ago

Oh, no, I've now and it works. Thank you very much!

I will combine:

  1. A Selenium web driver project to scrape all the IFRAME links
  2. Your tool to download them and convert them to puz files
  3. A tool to convert puz files to pdf

To obtain the results.

Thank you again for your great project and immediate help!

Best regards, D.R.

drauch commented 2 years ago

Oh, I'm still puzzled with one thing though. German umlauts (e.g., 'ä' or 'ü') are converted to their non-umlaut counterparts in the puz files. Although ISO 8859-1 should contain those characters. Is it possible that there is still an encoding problem somewhere?

Best regards, D.R.

thisisparker commented 2 years ago

Yeah, this is a known upstream bug I will have to think about. The short version is that I use a library which actually targets not 8859-1 exactly but rather ASCII. It degrades most things "correctly" but the German umlaut in particular is a special case, which the library maintainers have opted not to break into ae, ue, or oe because that's not an accepted transliteration in other languages which use the same characters. (I've known this was a bug that could bite German speakers, but you may be my first actual case of it.)

Obviously the ideal solution would be if I could just use Unicode throughout! It's a very frustrating .puz limitation. Given that the conversion from UTF-8 to 8859-1 or whatever is lossy and requires some judgment, it is way preferable for me to use a library, but there are a few cases where the distinction between ASCII and 8859-1 is meaningful and I should accommodate those.

drauch commented 2 years ago

Okay, thanks for the clarification. That's been really helpful to me. I replaced your call to unidecode with a call to my own fixup routine:

def fixup(str):
    str = str.replace("\u2013", "-")
    return str

That handles all the non-latin1-characters in crosswords from derstandard.at - of course this is only helpful to me, so I haven't made a pull request, your solution is of course the more general and cleaner one.

Thanks again!

Best regards, D.R.

thisisparker commented 2 years ago

Good to know this works at least in place though! I am currently thinking the more general approach will be a new lossy_latin1() function that calls unidecode on a per-character basis, along the lines of:

''.join(c if c <= '\xff' else unidecode(c) for c in unicode_string)

Performance-wise it's much worse, but at a scale that shouldn't matter. I'll update and close this issue when I've landed something like that, and you can probably plan to just get mainline updates from here out and have them work for you.

drauch commented 2 years ago

Oh, that's a great idea. I also think that for the amount of data contained in a puz file it should not matter much performance wise.

But don't feel obligated, with your nice setup.py script I can adjust the code myself for my purposes.

Really enjoying your script, thank you!

Best regards, D.R.

thisisparker commented 2 years ago

@drauch would you mind testing with the most recent commit in this repo? I think I've figured out a way to minimally invasively patch unidecode so it ignores latin1 characters. It works on my machine but I'd love a second set of eyes before shipping!

drauch commented 2 years ago

Yep, will do, but could take until the weekend, hope you don't mind.

drauch commented 2 years ago

Worked for the contents, however, the title hasn't worked out :-)

thisisparker commented 2 years ago

Thank you for testing this! Could you explain more what you mean that it hasn't worked out? (On my end, it scrapes the title successfully and stores it as, e.g., Standardrätsel D 9947.) The script applies exactly the same processing to the title as the contents, so this is a fun one. Is it possible the title is just being rendered differently by whatever software you use to open .puz files?

drauch commented 2 years ago

Yeah, I think you're right.