Closed drauch closed 2 years ago
The easiest way to do it is to get the URL of the iframe itself (in the case of the example you've given, it's https://cdn-eu1.amuselabs.com/pmm/crossword?id=fdb23f25&set=phoenixen&embed=1) and then pass that to xword-dl directly. Two notes on that: you'll almost definitely need to put the URL in quotes, and you'll need to install the very latest version from this repo, as I wasn't properly sanitizing the puzzle's author or copyright information until bd3e5ec11779aae3dc7985e7f1e72689dea90608. I'm not sure off the top of my head how you'd bypass the modal, but if this works for you then maybe that's enough!
Thanks so far. I installed the latest version and used the IFRAME address as you told me, however, I run into the following problem:
xword-dl "https://cdn-eu1.amuselabs.com/pmm/crossword?id=5a760e1e&set=phoenixen&embed=1"
Traceback (most recent call last):
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\Scripts\xword-dl.exe\__main__.py", line 7, in <module>
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\xword_dl.py", line 1091, in main
save_puzzle(puzzle, filename)
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\xword_dl.py", line 119, in save_puzzle
puzzle.save(filename)
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\puz.py", line 225, in save
puzzle_bytes = self.tobytes()
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\puz.py", line 240, in tobytes
self.global_cksum(), ACROSSDOWN.encode(ENCODING),
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\puz.py", line 369, in global_cksum
cksum = self.text_cksum(cksum)
File "C:\Users\Dominik\AppData\Local\Programs\Python\Python310\lib\site-packages\puz.py", line 353, in text_cksum
cksum = data_cksum(self.copyright.encode(ENCODING) + b'\0', cksum)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 114: ordinal not in range(256)
It looks like the crossword is using some kind of special dash character which cannot be encoded in latin1. Is there a reason why we don't use UTF8/Unicode for everything?
Best regards, D.R.
Can you confirm you've installed from this Github repository directly? I believe the error you're encountering is the one that I said in the last message I fixed, but that fix is not yet in the most recent "shipped" version on PyPI.
The .puz format specifies ISO 8859-1 for strings.
Oh, no, I've now and it works. Thank you very much!
I will combine:
To obtain the results.
Thank you again for your great project and immediate help!
Best regards, D.R.
Oh, I'm still puzzled with one thing though. German umlauts (e.g., 'ä' or 'ü') are converted to their non-umlaut counterparts in the puz files. Although ISO 8859-1 should contain those characters. Is it possible that there is still an encoding problem somewhere?
Best regards, D.R.
Yeah, this is a known upstream bug I will have to think about. The short version is that I use a library which actually targets not 8859-1 exactly but rather ASCII. It degrades most things "correctly" but the German umlaut in particular is a special case, which the library maintainers have opted not to break into ae
, ue
, or oe
because that's not an accepted transliteration in other languages which use the same characters. (I've known this was a bug that could bite German speakers, but you may be my first actual case of it.)
Obviously the ideal solution would be if I could just use Unicode throughout! It's a very frustrating .puz limitation. Given that the conversion from UTF-8 to 8859-1 or whatever is lossy and requires some judgment, it is way preferable for me to use a library, but there are a few cases where the distinction between ASCII and 8859-1 is meaningful and I should accommodate those.
Okay, thanks for the clarification. That's been really helpful to me. I replaced your call to unidecode
with a call to my own fixup
routine:
def fixup(str):
str = str.replace("\u2013", "-")
return str
That handles all the non-latin1-characters in crosswords from derstandard.at - of course this is only helpful to me, so I haven't made a pull request, your solution is of course the more general and cleaner one.
Thanks again!
Best regards, D.R.
Good to know this works at least in place though! I am currently thinking the more general approach will be a new lossy_latin1()
function that calls unidecode
on a per-character basis, along the lines of:
''.join(c if c <= '\xff' else unidecode(c) for c in unicode_string)
Performance-wise it's much worse, but at a scale that shouldn't matter. I'll update and close this issue when I've landed something like that, and you can probably plan to just get mainline updates from here out and have them work for you.
Oh, that's a great idea. I also think that for the amount of data contained in a puz file it should not matter much performance wise.
But don't feel obligated, with your nice setup.py script I can adjust the code myself for my purposes.
Really enjoying your script, thank you!
Best regards, D.R.
@drauch would you mind testing with the most recent commit in this repo? I think I've figured out a way to minimally invasively patch unidecode
so it ignores latin1 characters. It works on my machine but I'd love a second set of eyes before shipping!
Yep, will do, but could take until the weekend, hope you don't mind.
Worked for the contents, however, the title hasn't worked out :-)
Thank you for testing this! Could you explain more what you mean that it hasn't worked out? (On my end, it scrapes the title successfully and stores it as, e.g., Standardrätsel D 9947
.) The script applies exactly the same processing to the title as the contents, so this is a fun one. Is it possible the title is just being rendered differently by whatever software you use to open .puz files?
Yeah, I think you're right.
It's an AmuseLabs based crossword puzzle, e.g., you can see one here: https://www.derstandard.at/story/2000128976430/kreuzwortraetsel-d-9870
However, your AmuseLabs detector doesn't detect a puzzle, possibly because you have to confirm a "yes, watch page with ads" dialog before seeing the actual puzzle.
Is there a good way to use your tool to download those puzzles? If necessary, and you know how, I can also provide a pull request if you point me into the right direction. However, I'm not sure how and if you can accept such a "view with ads" dialog.
Best regards, D.R.