thisisparker / xword-dl

⬛⬜⬛ Command line tool to scrape crosswords from online solvers and save them as .puz files ⬛⬜⬛
MIT License
146 stars 31 forks source link

Add downloader for AVCX #202

Open afontenot opened 3 months ago

afontenot commented 3 months ago

Adds a downloader for AVCX (American Values Club Crosswords). These are popular crosswords from a variety of creators, see https://avxwords.com/about-us/.

This is a subscription-only crossword series, and requires authentication. This is handled in exactly the same way as NYT.

This downloader may not seem to serve an obvious purpose, given that AVCX emails subscribers an AcrossLite compatible .puz file for every new release. However, I'm thinking it will be useful for the following features:

afontenot commented 3 months ago

Just noticed there's sort of a JPZ parser already in compilerdownloader.py, but there are subtle differences. The compiler parser doesn't handle when the clue text is inside a <span> element, as it is in AVCX, and it also appears to have no handling at all for barred crosswords. It would have to be extended if it were to work for AVCX, but it's certainly a starting point.

thisisparker commented 3 months ago

At a glance, this is great! I want to poke around at it some and test it out, but as an AVCX subscriber I would totally use this.

afontenot commented 3 months ago

@thisisparker Question about using puzzle.notes in a downloader: the saved file lacks the newline characters of the original string. Is this something that xword_dl is stripping out (e.g. perhaps treating the notes field as HTML?), or do I need to chase down an issue in the puzpy library?

thisisparker commented 3 months ago

It's likely my cleanup function being a little overzealous. These are \n characters getting stripped? I will take a look and confirm.

afontenot commented 3 months ago

It's likely my cleanup function being a little overzealous. These are \n characters getting stripped? I will take a look and confirm.

Yep, my AVCX code slaps several bits of metadata into the notes with "\n\n".join(self.descriptions), but there are no \n characters at all in the resulting file.

afontenot commented 3 months ago

I had a look myself, this is an issue with using html2text on the notes. Space is not significant in HTML so this is correct behavior from the html2text library, but we should probably only be calling it on fields that contain HTML.

Also, what's the intended purpose of using this library? It converts HTML to a Markdown equivalent, but does the AcrossLite PUZ specification support Markdown text representation? Are there specific programs that display it correctly? I tried putting the HTML markup directly in puzzle.notes but the resulting document contained a bunch of Markdown links which made the notes hard to read in Gnome Crosswords.

thisisparker commented 3 months ago

The intention behind html2text is to convert from something that looks "marked up" to something that doesn't, because some clients don't render html and e.g. <em>foreign phrase</em> probably looks worse than the same thing in _s. (In other words, I'm actually just looking for a "plaintext" representation of formatted text, and for formatting elements markdown is pretty good, but it's not great for links as you note.) This is kind of orthogonal to the puz spec itself, which is afaict silent on markup questions, though it's possible the "observed spec" has moved a bit in the direction of HTML if AcrossLite now supports it; I actually don't know whether that's the case.

That's all probably a matter of opinion! Which is why I added the --preserve-html flag, which should skip the invocation of html2text entirely. Again sorry, writing this quickly, but does that happen to do the right thing for you?

afontenot commented 3 months ago

Again sorry, writing this quickly, but does that happen to do the right thing for you?

Yes, that fixes the issue with removing new lines.

thisisparker commented 3 months ago

Yes, that fixes the issue with removing new lines.

Alright! Then one option is to pass it at runtime each time, or another would be to put a preserve_html line in your settings file (under the general section or a specific outlet). I'm not inclined to change this behavior in the short term because I personally use a client that doesn't render the HTML and I prefer the look of unformatted markdown, but I am aware that's probably increasingly idiosyncratic

afontenot commented 3 months ago

Hmm, well me not liking the look of it is one thing, but it removing any new lines in the notes string is another. That seems like it should be avoided. Should downloaders that have plain text notes replace \n with <br> to get the correct output?

Seems like this ought to affect the Puzzle Society downloader too, although that one is currently disabled.

thisisparker commented 3 months ago

I think that using <br> in these notes instead of \n is the right solution. By default they'll be converted, and if you're saving for a context that will render HTML, you'll be using the preserve flag and you'll still get the newlines.

Semantically it's probably even a touch better to just wrap paragraphs in <p> tags, which should have the same effect after html2text, but which might not be quite as concise as just using '<br><br>'.join(). (I can contrive a scenario where <p> rendering is cleaner than <br>s, but that's fully speculative.)