miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
841 stars 119 forks source link

use void tags as per HTML5 spec #145

Closed elandorr closed 2 years ago

elandorr commented 2 years ago

A small suggestion to be compliant with https://html.spec.whatwg.org/multipage/syntax.html#void-elements. Void elements don't need the closing slash.

pbodnar commented 2 years ago

@elandorr, thanks for your PR, but I'm afraid I cannot accept it as-is. The fact is that outputting for example <br /> is already HTML5-compliant, while it also has the practical effect of being XML-compliant. See for example https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br/1946442#1946442 for a quick recap.

So, I think that if we would also want to support the <br> variant, we should state a reason for it (is it required by some widespread tool, for example?) AND we would have to make this somehow (I've got no concrete idea on this yet) an optional switch in mistletoe.

elandorr commented 2 years ago

@pbodnar

Thanks for the SO link, but the spec is clear, it's just tolerated:

if the element is one of the void elements, or if the element is a foreign element, then there may be a single U+002F SOLIDUS character (/). This character has no effect on void elements

Google does it according to spec: https://google.github.io/styleguide/htmlcssguide.html#Document_Type

cmark as well: https://github.com/commonmark/cmark

Not sure how useful XML-compliance as suggested in the SO post is for an explicit HTML renderer. The usual parsers are fine with HTML5. (I have to further edit the output to fix e.g. table aligns as they aren't compliant, and that works fine.)

A switch is always nice; choice.

Don't think any tool cares about it. Real world pages don't follow standards today. It's only a small detail. Not even buzzword bootstrap follows any standard; to use it you have to accept the classic hacks :). Last time someone gave a damn about standards was with xhtml probably, when people proudly put the w3c validator icon up.

Every other part of a current project follows the spec, and having it mixed felt icky. I'm obsessing over details, feel free to ignore. Trying to create something standards-conform that will last til doomsday. Figured I'd attempt to follow as close as possible. (Semantics are still very ill-defined for example, not even the w3c itself follows their standard!)

Have a great remaining Sunday :)

pbodnar commented 2 years ago

@elandorr, thanks for your feedback, maybe you won't like me, but I believe I'm as pragmatic as can be in this area:

Thanks for the SO link, but the spec is clear, it's just tolerated:

if the element is one of the void elements, or if the element is a foreign element, then there may be a single U+002F SOLIDUS character (/). This character has no effect on void elements

Well, when something is optional, it doesn't mean it is not widely used or is planned to be deprecated, right? AFAIK HTML5 was intentionally created with the possibility to produce well formed XML documents.

Google does it according to spec: https://google.github.io/styleguide/htmlcssguide.html#Document_Type

Well, this seems to be a style-guide for a large, yet limited group of people working on Google stuff. I would be cautious to take this as a "reference" for how to write HTML (see this opinion, for example).

cmark as well: https://github.com/commonmark/cmark

Commonmark clearly states in their tests that <br /> should be output. But maybe I have missed something?

Not sure how useful XML-compliance as suggested in the SO post is for an explicit HTML renderer. The usual parsers are fine with HTML5. (I have to further edit the output to fix e.g. table aligns as they aren't compliant, and that works fine.)

A switch is always nice; choice.

Sure, but we are talking about that someone might rely on a tool which further processes the HTML output from mistletoe and feeding the output directly to an XSLT processor, for example. And if the processor doesn't know anything about HTML "gotchas" like void tags without a slash, we would break their function.

Don't think any tool cares about it. Real world pages don't follow standards today. It's only a small detail. Not even buzzword bootstrap follows any standard; to use it you have to accept the classic hacks :). Last time someone gave a damn about standards was with xhtml probably, when people proudly put the w3c validator icon up.

Yes, standards come and go, real-world scenarios & making our life simpler by selecting "the best possible intersection" should be our goal. For me, this is the slash variant for void tags (also required by SVG, BTW).

Every other part of a current project follows the spec, and having it mixed felt icky. I'm obsessing over details, feel free to ignore. Trying to create something standards-conform that will last til doomsday. Figured I'd attempt to follow as close as possible. (Semantics are still very ill-defined for example, not even the w3c itself follows their standard!)

I hope it is clear now that we are following the standards here and this should not cease to be truth sometime in the foreseeable future. OTOH, I'm not strictly against giving users also the choice to use the no-slash variant, provided there will be enough reasoning & interest in that.

Have a great remaining Sunday :)

Thanks, you too. :)

elandorr commented 2 years ago

Well, when something is optional, it doesn't mean it is not widely used or is planned to be deprecated, right?

Widely used doesn't mean correct, every big fw still uses the checkbox hacks too. Spec says it's entirely superfluous.

Commonmark clearly states in their tests that <br /> should be output. But maybe I have missed something?

Sorry that was my bad. It wasn't cmark's renderer, but the addition of lxml. It converts to void tags automatically apparently.

https://bugs.launchpad.net/lxml/+bug/1758553 Seems libxml2 is the active component and also respects void tags :P.

(see this opinion, for example)

Interesting, thank you.
If it were up to me, we'd have worked on getting xhtml production ready too. There was a lot of talk about accessibility with buzzwords being thrown around, but never a proper solution. Even today, even this very site, is an insult to past efforts. Users requiring accessibility features still rely on hacks upon hacks and wade through tons of entropy. But alas, there's no gopher style semantics anymore, and we now have HTML5 as the only somewhat clear standard. It'd be interesting to find out who profited off the chaos. You'd expect the most trivial of tasks in IT would've been perfected 30y later. Besides accessibility concerns, there's also the idea of standardized keyboard control for efficiency, which went completely into the void.

Don't quite see why he'd be against omitting defaults. Modern web traffic is already mostly waste and duplication, saving even a few % adds up. XML is very verbose, too. Since corporate abuse forced people into site isolation features and similar, we duplicate the same data over and over for no reason. I can keep at least my work minimal.

provided there will be enough reasoning & interest in that.

The strongest reason is present: it's useless. I'd ask the other way around: if someone wants to process the output of an HTML renderer, he ought to use HTML parsers, or use a switch to add compatibility that's not meant to be present in the first place. Or render to XML.

Yes, standards come and go, real-world scenarios & making our life simpler by selecting "the best possible intersection" should be our goal.

But I fully understand that. I've just been obsessing about details on this. Thanks for the nice talk anyway!

Something else I noticed with a lot more 'byte-impact':

cmark does not render align attributes when no colons are present: https://github.github.com/gfm/#tables-extension-, mistletoe does. You get a ton of useless left-aligns. (align attributes are now an error in the validator by the way, that's why cmark allows CSS instead. I avoid that too and use my own shorter classes.)

pbodnar commented 2 years ago

@elandorr, it looks like you really enjoy to discuss stuff, don't you? :)

Frankly, I don't get every thought you present, but I hope it doesn't matter that much. I would respond just to few points which I find noteworthy:

  1. I hope we agree on that in some context using the slash variant is useless, but it might be useful in others (which is strongly supported by the majority of highly accepted answers on the previously linked SO page, at minimum).
  2. For me and for you probably too, there are much bigger & annoying problems one needs to cope with in the IT world (like the lack of "standardized keyboard control" as you mention).
  3. Regarding rendering superfluous align="left" (from render_table_cell()), you are most probably right, thanks for pointing that out. It would be great to create a separate issue / PR for this, I guess some little impact analysis should be done there, because we should take care of aligning both <td>s as well <th>s - while they both have a different default alignment in HTML...