jpd236 / CrosswordScraper

Browser extension which downloads crosswords from crossword applets for offline solving.
Apache License 2.0
28 stars 1 forks source link

Crossword Compiler: successful and non-successful scraping #10

Closed arelkin closed 2 years ago

arelkin commented 2 years ago

Two different sites featuring Crossword Compiler. One successful, one not.

I suspect it could have to do with the version of CCW being used. The successful scrape gathers from CCW with copyright 2021, while the unsuccessful scrape is from CCW with copyright 2015.

Error: https://www.washingtonexaminer.com/crossword-mind-games

Success: https://crosswordsbackwards.com/backward-crossword-puzzle-380/

ccw2021-successful ccw2015-scrape-error

arelkin commented 2 years ago

Actually an amendment to this, it may be just that one puzzle that has some sort of error. Another, earlier, puzzle from Washington Examiner was successfully scraped:

https://www.washingtonexaminer.com/crossword-empty-shelves

Here is a list of all puzzles: https://www.washingtonexaminer.com/search-result?q=CROSSWORD

jpd236 commented 2 years ago

Yeah, this specific puzzle does something different with the JPZ that the parser didn't know how to handle:

      <word id="12" x="10-15" y="5" solution="[redacted]">
        <cells x="1-10" y="6"/>
      </word>

We need to support ranges in the cells tag as well. This should hopefully be rare (in years of downloading puzzles from various sources, I must not have come across a puzzle that did this).

jpd236 commented 2 years ago

Should be fixed in the next release. Thanks for the report!

arelkin commented 2 years ago

Thanks for all you hard work on this plug-in.

arelkin commented 2 years ago

By the way, there are some unusual grids out there that make use of squares/lights that are larger than 1x1. Here are two examples:

https://www.xwordinfo.com/Crossword?date=4/4/2013 https://www.xwordinfo.com/Crossword?date=9/6/2012

jpd236 commented 2 years ago

PUZ/JPZ doesn't support large squares - and the NYT applet (that we scrape from) doesn't either. Not much we can do there except match what the applet is doing.

arelkin commented 2 years ago

Very true, but the PDF could still be successfully created!

jpd236 commented 2 years ago

I don't see a way to do this automatically, short of just pointing to the PDF provided by the NYT directly instead of generating our own. The problem is that the embedded puzzle data doesn't provide (to my knowledge) any signal that the squares are large, so we have no way of knowing that there's anything special going on in the scraper. It just circles them (or in some cases, does nothing at all) and puts a note that the printed version is different.

arelkin commented 2 years ago

I didn't mean to imply about scraping NYT puzzles. I understand they are behind a paywall.

I was only using those two examples because they also had unusual grid structure, just like what you discovered with the scrape error I initially submitted.

I was also suggesting that, if some grids show a scraping error for PUZ or JPZ, the Scraper could still offer just the PDF option.