HTML spec compliance - Githubissues

jleedev commented 2 years ago

 ~ ¶ prettier --version 
2.7.1
 ~ ¶ echo '<!doctype html><html lang=en><title>hello</title></body>' | prettier --parser html 
[error] stdin: SyntaxError: Unexpected closing tag "body". It may happen when the tag has already been closed by another tag. For more info see https://www.w3.org/TR/html5/syntax.html#closing-elements-that-have-implied-end-tags (1:50)
[error] > 1 | <!doctype html><html lang=en><title>hello</title></body>
[error]     |                                                  ^^^^^^^
[error]   2 |

Unclear why prettier is complaining about an "Unexpected closing tag" and linking to a random page of the HTML Standard. Perhaps use a correct HTML parser such as https://github.com/inikulin/parse5? The https://validator.w3.org/ reports no warnings or errors on this.

sosukesuzuki commented 2 years ago

Prettier uses https://github.com/ikatyang/angular-html-parser for parsing HTML. It throws a syntax error for tags that are not closed. There is no use case where such an unclosed tag would be useful, so I don't see the need to support it.

alexander-akait commented 2 years ago

Yes, it is corrent HTML syntax, but I don't think it is good to use it, also you can fix it by removing </body>, so you will not have an unclosed tag

thorn0 commented 2 years ago

Also, what would the expected behavior be? How is Prettier supposed to format such markup?

jleedev commented 2 years ago

Yes, it is [correct] HTML syntax, but I don't think it is good to use it

If this is the opinion of Prettier, then the correct behavior would be to insert the implied <body> start tag. That is the entire point of why I want to use this software.

Similar to omitting semicolons at the end of statements — it's known to be surprising, e.g.

 ~ ¶ cat a.js 
(() => a);
(() => b)
(() => c)
 ~ ¶ <a.js prettier --parser=babel 
() => a;
(() => b)(() => c);
 ~ ¶

and there's value in the code formatter illustrating to me what the javascript compiler will actually see, whether I have left off a semicolon as a stylistic habit or just a typo.

Optional start and end tags are one of the least confusing parts of the HTML parser.

This one is not even a parse error according to the spec:

 ~ ¶ cat | prettier --parser=html 
<b><p>html
<b><p>parsing
<b><p>algorithm

<b
  ><p>
    html
    <b
      ><p>
        parsing <b><p>algorithm</p></b>
      </p></b
    >
  </p></b
>

but the actual parse is this:

<b><p>html
<b></b></p><p><b>parsing
<b></b></b></p><p><b><b>algorithm
</b></b></p></b>

and I would never write this on purpose, and I would appreciate the computer telling me that I have entered something strange, either by specifically calling out the parse that happened, or simply by showing me how browsers parse this.

The HTML Standard defines the parser in great detail for a reason, and linking to it is an explicit suggestion to the user that you have read it.

alexander-akait commented 2 years ago

For autoclosing we need full featured HTML parser, otherwise it will be tricky, yes, it looks simple for </body>, but HTML is more complicated (especial tables/templates).

jleedev commented 2 years ago

Also, what would the expected behavior be? How is Prettier supposed to format such markup?

> console.log(parse5.serialize(parse5.parse(
  `<!doctype html><html lang=en><title>hello</title></body>`)))
<!DOCTYPE html><html lang="en"><head><title>hello</title></head><body></body></html>

alexander-akait commented 2 years ago

I don't want to say you are not right, it's just that no one will find time for such a global change - change logic in parser (note - vue/svelte/etc as I remmeber throw errors on unclosed tags to simplify HTML logic and improve perf) and implement full featured HTML parser + save compatibility with vue/svelte/etc can be really hard. I recently implemented HTML lexer/parser on rust and fully followed the specification (https://github.com/swc-project/swc/blob/main/crates/swc_html_parser/src/parser/mod.rs), you can see how many logic there.

Another solution is walk through AST tree and recovery broken/unclosed nodes, but it will be bad for perf...

jleedev commented 2 years ago

Could use --parser=html5 to say "use the web compatible parser", if --parser=html is intended to cover a range of backend HTML templating languages that behave differently.

alexander-akait commented 2 years ago

Yeah, but it is a breaking change, because now we don't do it, not sure we can shipped it for v3

thorn0 commented 2 years ago

@jleedev Your points and use case are definitely valid, but if I'm not mistaken, you're the first person on this issue tracker who requests this while the HTML support was added as early as 2018. So this sounds rather like a good idea for a plugin than something to be considered for the core.

Also, what would the expected behavior be? How is Prettier supposed to format such markup?
console.log(parse5.serialize(parse5.parse(
`<!doctype html><html lang=en><title>hello</title></body>`)))
<!DOCTYPE html><html lang="en"><head><title>hello</title></head><body></body></html>

Most of the time a code formatter isn't expected to print tags that were absent in the input (head, tbody, etc). Prettier is a practical tool, it's a "code formatter" in the first place, and only then an "AST printer". Unlike JS, any HTML tag soup can be somehow parsed by a browser. If Prettier worked that way too, that would lead to a messy editing experience. Imagine, you entered the opening tag for a div but not the closing one yet and hit save. What happens next is the rest of the markup is reindented because it's inside this div now. Theoretically, that'd be correct, but practically annoyingly inconvenient. So a trade-off needs to be made, and using the parser from Angular is such a tradeoff. The Angular team faced a similar dilemma themselves, and they decided that what makes sense for browsing doesn't necessarily make sense for authoring. E.g., unclosed tags most of the time are unintentional mistakes, something for the developer to fix, so tools shouldn't try to work them around, an explicit error is much more useful.

<b><p>html
<b><p>parsing
<b><p>algorithm

This is an example of what I'm talking about. It'd be less confusing if Prettier/Angular just refused to parse the code in this case. The current formatting/parsing is useless and looks like a bug to me.

I would appreciate the computer telling me that I have entered something strange, either by specifically calling out the parse that happened, or simply by showing me how browsers parse this

I agree, for educational and debugging purposes, it'd be really cool to be able to explore how browsers see things. But as I described above, in the day-to-day editing experience, this isn't exactly what's expected from a formatter.

Could use --parser=html5 to say "use the web compatible parser"

Yes. And this can be implemented as a plugin.

jleedev commented 2 years ago

Well, while I'm editing javascript code, I can enter just the opening curly brace of a function, and the typescript support in vs code becomes upset with me until I'm done, and I certainly don't expect prettier to handle that with grace either.

I think the difference is that many people input html into prettier which is angular/vue/svelte templates, which are A) small fragments, not entire documents, despite being a standalone file, and B) parsed by a compiler, not a browser. If it were up to me, those files would not be named ".html", but, well.

Similar to how a file named ".js" might be a module script or not depending on context, or other syntax extensions such as typescript and jsx. It makes sense to have to explain to the computer whether my "html" files are fragments for a templating engine, or browser-ready documents.

parse5 apparently can include the source code information in the AST. So while this invocation unnecessarily adds tags I don't care about (specifically 'colgroup'):

> parse5.serialize(parse5.parse('<table><col span=2><tbody><tr><td><td>'))
'<html><head></head><body><table><colgroup><col span="2"></colgroup><tbody><tr><td></td><td></td></tr></tbody></table></body></html>'
> parse5.serialize(parse5.parseFragment('<table><col span=2><tbody><tr><td><td>'))
'<table><colgroup><col span="2"></colgroup><tbody><tr><td></td><td></td></tr></tbody></table>'

when I paste it into https://astexplorer.net/, it includes sourceCodeLocation: null on the tags that didn't exist, so it should in principle be possible to maintain their absence. (Might be a feature request to be able to include just the start tag and not the end tag, but that's way in the weeds.)

thorn0 commented 2 years ago

while I'm editing javascript code, I can enter just the opening curly brace of a function, and the typescript support in vs code becomes upset with me

That's the difference between JS and HTML. Parse5 successfully parses the unclosed div tag. If Prettier used parse5, after pressing save, you'd face the effect I described above.

until I'm done, and I certainly don't expect prettier to handle that with grace either.

But you expect it to handle just the closing </body>, right? I don't see much difference between the two cases.

Are you interested in creating a Prettier plugin that would use parse5 and convert its AST to the Angular format, so that the existing printer could be reused? If so, I might be interested in helping you build it. Meanwhile I'm going to close this issue as out of scope for Prettier core.

jleedev commented 2 years ago

But you expect it to handle just the closing </body>, right? I don't see much difference between the two cases.

Yes, because I think reporting parse errors is a useful thing for a syntax checker to do, and omitting an optional tag isn't a parse error.

This is a parse error:

(Using Python html5lib here since parse5 doesn't seem to be very good at reporting parse errors.)

<!doctype html><div>
<h1>title</h1>
<div class="entry1">
  1
<div class="entry2">
  2
</div>
</div>
      ^
      expected-closing-tag-but-got-eof {}

This is not a parse error:

<!doctype html></body>

There are some fascinating examples in the HTML Standard such as:

<!doctype html><table><b><tr><td>aaa</td></tr>bbb</table>ccc
                         ^                                  ^
                         unexpected-start-tag-implies-table-voodoo {'name': 'b'}
                                                            expected-closing-tag-but-got-eof {}

and

<!doctype html><a><table><a>
                            ^
                            unexpected-start-tag-implies-table-voodoo {'name': 'a'}
                            unexpected-start-tag-implies-end-tag {'startName': 'a', 'endName': 'a'}
                            unexpected-end-tag {'name': 'a'}
                            eof-in-table {}

The angular parser happily accepts this last one, but again completely incorrectly.

Another, which isn't a parse error but prettier mishandles, introducing parse errors where I had none:

¶ printf '<ul><li>bbb<p>ccc<li>ddd</ul>' | prettier --parser=html 
<ul>
  <li>
    bbb
    <p>
      ccc
      <li>ddd</li>
    </p>
  </li>
</ul>
¶ python3 ./hh.py 
## input
<ul><li>bbb<p>ccc<li>ddd</ul>
## with optional tags
<ul><li>bbb<p>ccc</p></li><li>ddd</li></ul>

And I would write this on purpose too, since the end tags for p and li are optional, and it's pretty intuitive what should happen here.

Anyway, I've made my point. I think the tags for html, head, and body being optional is one of the least weird things in the standard, and there are subtler cases that can easily pass by a naive parser that treats HTML as "somewhat similar to XML with implied end tags upon EOF". It seems that a dedicated HTML5 mode/plugin distinct from the existing HTML mode makes sense. I don't have bandwidth to commit to writing it now.

thorn0 commented 2 years ago

Yes, because I think reporting parse errors is a useful thing for a syntax checker to do, and omitting an optional tag isn't a parse error.

I see your point. But it looks like there is no parser written in JS that does what you want (?). And Prettier isn't really a syntax checker.

Also I had no idea Angular is that bad at parsing p tags. And I'm sorry, but it turns out I kind of lied to you about that example with an unclosed div. Prettier parses it. Probably, we really should consider choosing another parser for the core (or report all this stuff to Angular's issue tracker).

fisker commented 2 years ago

We use parse5 before. #5127

fisker commented 2 years ago

Funny.. @thorn0 you suggest that https://github.com/prettier/prettier/issues/5098#issuecomment-423055181

thorn0 commented 2 years ago

Yes, the requirements are conflicting, but still, Angular could be better at parsing standard tags

prettier / prettier

HTML spec compliance #13479