jpd236 / CrosswordScraper

Browser extension which downloads crosswords from crossword applets for offline solving.
Apache License 2.0
28 stars 1 forks source link

Enumerations don't convert when scraping #26

Closed boisvert42 closed 1 year ago

boisvert42 commented 1 year ago

See e.g. https://amuselabs.com/pmm/crossword?id=dc1ff54d&set=5e421773feaecbd5fb370299e24ff504d0c65f5bf7294f8157dd130c486d3cca&embed=1

I believe what's happening here is that this has a header of "wordLengthsEnabled": true

and each clue has a wordLens key and value, like "wordLens": [1,6] which corresponds on display to (1, 6) at the end of the clue. In theory you could use JPZ's "format" attribute for this but I think it makes more sense to just tack it on to the end of each clue.

boisvert42 commented 1 year ago

looking at the source i guess this should be an issue in the kotwords repo?

jpd236 commented 1 year ago

Thanks for the details! It is a kotwords issue, but it's fine to report it here as well.

I went with the format attribute as that seems like the most appropriate place for it, and since we should be reading it from JPZ files as well when it exists. The main issue is that some JPZs in the wild set format in strange ways; some files have format set for some but not all clues, and others have it just set to the number of squares, even for multi-word answers. We wouldn't want to start propagating those, even if they're technically in the original source data, if the applet they're embedded in wasn't showing them. So I've done my best to identify any such sources still in the wild and configure them to strip the format even if it's present, while propagating it by default.

I also note that http://crosswordnexus.com/solve doesn't respect the format attribute, but it probably should :)