thisisparker / xword-dl

⬛⬜⬛ Command line tool to scrape crosswords from online solvers and save them as .puz files ⬛⬜⬛
MIT License
140 stars 30 forks source link

Rebus #118

Closed vcifello closed 7 months ago

vcifello commented 10 months ago

Hi Parker,

Thanks for this really amazing library! I ported @confuzzle into python added parsing from JSON and the unimplemented Rebus/Gext sections. I wanted to use puzpy since it it so well tested which brought me here.

The Rebus parsing for NYT does not work. Try to open 7/13/23, for example.

The problems are in class NewYorkTimesDownloader(BaseDownloader): def parse_xword(self, xword_data): lines 130 to 167

I don't know how to do a pull request (amateur :)), but here is the updated algorithm that works on multiple test puzzles with rebus and/or circles

I hope you find this useful.

` solution = '' fill = '' markup = b'' rebus_board = [] rebus_index = 1 rebus_table = '' rebus_dict = {}

    for idx, square in enumerate(xword_data['body'][0]['cells']):
        if not square:
            solution += '.'
            fill += '.'
            rebus_board.append(0)
        elif square and len(square['answer']) == 1:
            solution += square['answer']
            fill += '-'
            rebus_board.append(0)
        else:
            solution += square['answer'][0]
            fill += '-'
            if square["answer"] not in rebus_dict.keys():
                rebus_table += '{:2d}:{};'.format(rebus_index, square['answer'])
                rebus_index += 1
                rebus_dict[square["answer"]] = rebus_index
                rebus_board.append(rebus_index)
            else:
                rebus_board.append(rebus_dict[square["answer"]])

        markup += (b'\x80' if square.get('type') == 2 else b'\x00')

    self.solution = solution
    self.fill = fill

    if any(rebus_board):
        self.extensions[b'GRBS'] = bytes(rebus_board)
        self.extensions[b'RTBL'] = rebus_table.encode(ENCODING_UTF8)
        self._extensions_order.extend([b'GRBS', b'RTBL'])
        #self.rebus()

    if b'\x80' in markup:
        self.extensions[b'GEXT'] = markup
        self._extensions_order.append(b'GEXT')
        #self.markup()

`

thisisparker commented 10 months ago

Thank you for the nice words, and I hope I can be helpful here! Can you give me a little more detail on how it's not working for you though? If I run xword-dl nyt -d 7/13/23 right now, the resulting puz file appears to be well-formed with a working rebus. (I tested by opening in https://crosswordnexus.com/solve/, and hitting reveal. I also tested in https://squares.io.) Is it possible there's an implementation issue in your client?

thisisparker commented 10 months ago

Similarly just tested with xword-dl nyt -d 2/12/23 in case there was an issue with circles and rebuses, but that one appears to work just fine as well!

vcifello commented 10 months ago

Fascinating! These puzzles open correctly in https://crosswordnexus.com/solve/

Please try to open them in Acrosslite - disaster.

I have no idea how these binarys are being parsed by crosswordnexus. It reaaly doesn't make any sense since they really don't conform to the known puz format.

https://github.com/rjkat/confuzzle/blob/master/puz.md

GEXT should have b'\x80' written for circles which are identified as cell type==2 in nytJson format.

Yet, your code: markup += (b'\x00' if square.get('type') == 1 else b'\x80')

writes b'\x80' for cell types None, 2, and 3 !!! Shockingly, this somehow works in xwordnexus. How utterly bizarre.

Take a look at the Hex editor screenshot below. There are a lot of b'\x80' written in a puzzle WITHOUT circles. Why exactly are they present if there are no circles?

Oddly, xwordnexus doesn't show any circles WTF?!

RTBL: the rebus key reverts to one digit somehow (???when it is run through the internal Rebus class in puz.py ???) You can see this in the hex screen shot. The hex editor shows 0:DIE;1DIE;2DOC; 0:DIE; 1:DIE; 2:DOC; <numbers should be 2 digits as you know since you coded: rebus_table += '{:2d}:{};'.format(rebus_index, square['answer'])

Yet, xwordnexus parses this fine.

You may want to compare your binaries to the binaries written created by Crossword Scraper(chrome extension). https://github.com/jpd236/CrosswordScraper

In any event, my updated algorithm writes exactly the same GRBS RTBL GEXT as crossword scraper and conforms to the above document.

See screenshots below.

Screenshot 2023-08-27 at 2 21 50 PM Screenshot 2023-08-27 at 2 19 01 PM
thisisparker commented 10 months ago

I don't believe AcrossLite distributes a version for the operating system that I use, so I don't have a copy handy. I do try to conform to the "specification," but given that it's reverse-engineered and copied many times over, I don't think it makes sense to treat it as gospel. For example, crosswordnexus is maintained by one of the people cited in creating the original document, and clearly our understanding of the specification is somewhat aligned, even if it deviates from the reverse engineering doc.

My parser is indeed a little generous with its application of "circled squares" in terms of NYT puzzles. That is because the Times solver allows for other kinds of special markings such as shading; in lieu of .puz format support for shading and without any insight into how it's used in a particular puzzle, I've opted to circle those squares so the solver knows they are "special." Judgment call! You may disagree! I'll allow it's probably a bug that None type squares end up with the circle indication, though in practice as you can see clients tend to allow it. I don't really think it's "utterly bizarre." If you have documentation for what the various square types signify in NYT's json standard, that would be well-received.

It is a little hard to parse the other differences here, because no Pull Request means no automated diff, and no ability to comment on particular lines. I know Github can be tricky to navigate, but when used correctly it does provide some useful affordances for reading and editing code.

I like the Crossword Scraper extension and I appreciate the reverse-engineered spec. If the issue here is that xword-dl does not produce byte-for-byte matching binaries with them, then we should probably rename it. PRs as ever are welcome.

thisisparker commented 10 months ago

I'll note, I'm not super familiar with it, but another client that appears to handle the xword-dl generated puz file for 7/13/23 is Confuzzle.app. Is that the one you're porting?

edsantiago commented 10 months ago

FWIW xword-dl nyt -d 7/13/23 produces a .puz file that works perfectly with AcrossLite on my system (Linux; containerized AcrossLite image using a twenty-or-more-year-old 32-bit binary).

vcifello commented 10 months ago

Thanks for the info Parker! I appreciate your time and expertise. I meant that it is utterly bizarre that b'\x80' in GEXT results in a circle only sometimes. It seems non-deterministic.

I ported the confuzzles read/write libraries, but he did not implement the "sections" GEXT etc. I manages to do it. Maybe he implemented it, but only in his app. The read/write files say that he did not.

https://github.com/rjkat/confuzzle/tree/master/%40confuzzle/puz-crossword ""Limitations The solution in the above clue is scrambled, unscrambling is not yet implemented. Other .puz features such as rebuses and timers have also not yet been implemented.

I felt it was overall better to just extend puzpy and use your json parser.

Your library is really amazing and I have learned a lot from it.

That's interestiing ed. I wonder if all these differences are occuring because of the v1 vs v2 formats written in the binary (Parker is writing 1.3) and the program parsing it. Your Acrosslite is likely version 1.x

thisisparker commented 10 months ago

I meant that it is utterly bizarre that b'\x80' in GEXT results in a circle only sometimes. It seems non-deterministic.

I think the ones you highlighted above correlate to black squares. (You can even spot the distribution in the hex editor shot you posted!) I imagine most clients do not even look at the markup for black squares. As you noted, my parser assigns b'\x80' to squares with no type, which is a bug that I'll fix.

I wonder if all these differences are occuring because of the v1 vs v2 formats written in the binary

This could definitely be a factor. My impression of the v2 format is that it was rolled out pretty haphazardly and is even less documented than v1. It's a shame because it solves many of the issues I have with the puz spec generally, but that lack of documentation has meant that very little tooling for it exists. I don't think I would've tried building xword-dl if not for puzpy, and AFAICT nobody is building puzpy for v2.

vcifello commented 10 months ago

Not sure if this will be of interest, but there is a text format for puz that can be opened in Acrosslite and then saved as a puz file. This is one way of probing the format.

https://www.litsoft.com/across/docs/AcrossTextFormat.pdf

Here are the results for a 5x5 grid with circles and rebus: circles.txt

Screenshot 2023-08-27 at 5 27 36 PM Screenshot 2023-08-27 at 5 28 10 PM Screenshot 2023-08-27 at 5 29 07 PM

It would be nice to have an official spec.

Thanks again, Parker!