Open afontenot opened 3 months ago
Just noticed there's sort of a JPZ parser already in compilerdownloader.py
, but there are subtle differences. The compiler parser doesn't handle when the clue text is inside a <span>
element, as it is in AVCX, and it also appears to have no handling at all for barred crosswords. It would have to be extended if it were to work for AVCX, but it's certainly a starting point.
At a glance, this is great! I want to poke around at it some and test it out, but as an AVCX subscriber I would totally use this.
@thisisparker Question about using puzzle.notes
in a downloader: the saved file lacks the newline characters of the original string. Is this something that xword_dl is stripping out (e.g. perhaps treating the notes field as HTML?), or do I need to chase down an issue in the puzpy library?
It's likely my cleanup function being a little overzealous. These are \n
characters getting stripped? I will take a look and confirm.
It's likely my cleanup function being a little overzealous. These are
\n
characters getting stripped? I will take a look and confirm.
Yep, my AVCX code slaps several bits of metadata into the notes with "\n\n".join(self.descriptions)
, but there are no \n
characters at all in the resulting file.
I had a look myself, this is an issue with using html2text
on the notes. Space is not significant in HTML so this is correct behavior from the html2text
library, but we should probably only be calling it on fields that contain HTML.
Also, what's the intended purpose of using this library? It converts HTML to a Markdown equivalent, but does the AcrossLite PUZ specification support Markdown text representation? Are there specific programs that display it correctly? I tried putting the HTML markup directly in puzzle.notes
but the resulting document contained a bunch of Markdown links which made the notes hard to read in Gnome Crosswords.
The intention behind html2text is to convert from something that looks "marked up" to something that doesn't, because some clients don't render html and e.g. <em>foreign phrase</em>
probably looks worse than the same thing in _
s. (In other words, I'm actually just looking for a "plaintext" representation of formatted text, and for formatting elements markdown is pretty good, but it's not great for links as you note.) This is kind of orthogonal to the puz spec itself, which is afaict silent on markup questions, though it's possible the "observed spec" has moved a bit in the direction of HTML if AcrossLite now supports it; I actually don't know whether that's the case.
That's all probably a matter of opinion! Which is why I added the --preserve-html
flag, which should skip the invocation of html2text
entirely. Again sorry, writing this quickly, but does that happen to do the right thing for you?
Again sorry, writing this quickly, but does that happen to do the right thing for you?
Yes, that fixes the issue with removing new lines.
Yes, that fixes the issue with removing new lines.
Alright! Then one option is to pass it at runtime each time, or another would be to put a preserve_html
line in your settings file (under the general
section or a specific outlet). I'm not inclined to change this behavior in the short term because I personally use a client that doesn't render the HTML and I prefer the look of unformatted markdown, but I am aware that's probably increasingly idiosyncratic
Hmm, well me not liking the look of it is one thing, but it removing any new lines in the notes string is another. That seems like it should be avoided. Should downloaders that have plain text notes replace \n
with <br>
to get the correct output?
Seems like this ought to affect the Puzzle Society downloader too, although that one is currently disabled.
I think that using <br>
in these notes instead of \n
is the right solution. By default they'll be converted, and if you're saving for a context that will render HTML, you'll be using the preserve flag and you'll still get the newlines.
Semantically it's probably even a touch better to just wrap paragraphs in <p>
tags, which should have the same effect after html2text
, but which might not be quite as concise as just using '<br><br>'.join()
. (I can contrive a scenario where <p>
rendering is cleaner than <br>
s, but that's fully speculative.)
Adds a downloader for AVCX (American Values Club Crosswords). These are popular crosswords from a variety of creators, see https://avxwords.com/about-us/.
This is a subscription-only crossword series, and requires authentication. This is handled in exactly the same way as NYT.
This downloader may not seem to serve an obvious purpose, given that AVCX emails subscribers an AcrossLite compatible .puz file for every new release. However, I'm thinking it will be useful for the following features: