[HTML] [RFC] Adopt the HTML5 parsing algorithm as the standard for highlighting.

Thom1729 commented 5 years ago

History

The HTML language has had a long, torturous history. It was originally defined using SGML, but in practice browser vendors never implemented the spec fully or correctly. Messy "tag soup" abounded. Browsers did their best to handle flagrantly invalid code, often in incompatible ways, even while adding their own nonstandard extensions.

The default HTML syntax is designed to handle this sort of nonsense. It isn't designed from the ground up to parse HTML "correctly", because there was no such thing; rather, it, like browsers, evolved piece by piece to handle more and more cases and quirks. Had the syntax definition actually implemented the language as defined by the W3C specs, it would have failed hopelessly on real code that Sublime users regularly encounter.

The HTML5 spec redefined the language. HTML5 is no longer SGML or XML, but its own self-contained format. When defining the syntax, the HTML5 authors prioritized compatibility. The spec is designed to reflect not some theoretical ideal, but rather the way that popular browsers would interpret real code. Importantly, the authors also published an explicit parsing algorithm that unambiguously specifies how to interpret any stream of bytes as HTML5, no matter how awful. Modern browsers really do use this algorithm.

Proposal

I propose that we adopt the HTML5 parsing algorithm as our reference for the HTML syntax definition.

We've already been using it in practice for a number of PRs. In fact, this proposal is arguably both redundant and several years late. But given the importance of the default HTML syntax definition (and the large potential impact of any changes) I think it's worthwhile to have the discussion explicitly and to hopefully reach an informal consensus that will inform future work on the syntax.

In particular, I'm hoping that there may be general agreement on the following statements:

Valid HTML5 code should be parsed as described in the (strict) syntax spec.
All HTML code should generally be parsed as described in the HTML5 parsing algorithm.
When the HTML5 parsing algorithm disagrees with the default HTML syntax, this should generally be considered a bug in the default syntax.

What this proposal is not

I'm not suggesting any radical rewrite of the default HTML syntax. There's no need for anything like that. Whatever changes should be made can probably be done in small PRs.

I'm not suggesting that we change the scopes we use. That's a separate conversation, and probably a non-starter.

I'm not suggesting that we should never deviate from the parsing algorithm. As with any other language, we may need to in some cases for practical or usability reasons. Some examples off the top of my head of features of the parsing algorithm that we probably wouldn't implement:

The encoding detection algorithm.
The list of all 2,231 named character references.
The stack of open elements, the tree construction algorithm, the adoption agency algorithm, and so on.

Questions, comments, complaints, suggestions?

Thom1729 commented 5 years ago

@deathaxe

deathaxe commented 5 years ago

whatwg.org is what I used and would use as reference for any changes as well as it is the latest available reference with precise rules how to handle things, even though I find some statements a bit confusing or even contradicting so far.

The syntax should focus on latest html5 developments, but without introducing incompatibilities with older sites. There may be older pages using other than the allowed <!DOCTYPE html> tags, which should not cause issues for instance.

Do you have certain bugs, enhancements or features in mind?

Thom1729 commented 5 years ago

The HTML5 parsing algorithm is designed to handle horrifying old code as well as valid new code. (For example, the parsing algorithm contains painfully detailed instructions for parsing full old-style doctypes.) The intent is that a web browser can use this algorithm for every HTML page regardless of version and it will provide an interpretation that is reasonable and conforms to common practice. I think that this is a reasonable description of what a syntax highlighter should do as well.

The HTML5 syntax spec, on the other hand, describes valid HTML5 code and does not explain in detail how to deal with code that is not valid HTML5. This is why I think that the parsing algorithm is most appropriate for our purposes. (It also helps that it's defined not by a formal grammar but by a state machine that would map beautifully onto a sublime syntax.)

I'm not really thinking of specific examples. The points of deviation between the parsing algorithm and the syntax definition are likely to be obscure and apparently unmotivated, where the only reason to prefer one behavior over another is the parsing algorithm. When such deviations have come to our attention, we've generally patched them according to either the HTML language spec or the parsing algorithm. An explicit consensus on the parsing algorithm would facilitate more proactive bug-finding: comparing the algorithm specification to the syntax definition to tease out deviations that would be bugs by definition. If a deviation from the algorithm is ipso facto a bug, this removes potential subjectivity.

deathaxe commented 5 years ago

I think I got your point. Fully agree with you. The parsing algorithm provides the most general and robust set of rules for html highlighting to refer to when making changes to HTML.sublime-syntax.

keith-hall commented 5 years ago

Sounds sensible to me 👍

sublimehq / Packages