Add support for parsing `<!DOCTYPE html>`

rwjblue commented 5 years ago

The spec says this about <!DOCTYPE:

DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.

This fixes an issue where we would end up in an incorrect state when the <!DOCTYPE declaration was found (e.g. https://github.com/ember-template-lint/ember-template-lint/issues/719).

Addresses https://github.com/ember-template-lint/ember-template-lint/issues/719 Addresses https://github.com/stefanpenner/find-scripts-srcs-in-document/issues/1

The specific breaking changes here are that the delegate now must have the following new methods:

  beginDoctype(): void;
  appendToDoctypeName(char: string): void;
  appendToDoctypePublicIdentifier(char: string): void;
  appendToDoctypeSystemIdentifier(char: string): void;
  endDoctype(): void;

Closes https://github.com/tildeio/simple-html-tokenizer/issues/28.

Turbo87 commented 4 years ago

The specific breaking changes here are that the delegate now must have the following new methods

can we make this non-breaking by calling those methods only if they exist?

wycats commented 4 years ago

for the benefit of historical info: the original theory of this library is that it was basically for body templates, and therefore I didn't implement the states for doctype/script, etc. This was in the interest of keeping the library reasonably small: the states I left out are something like half of all tokenizer states!

I have no problem with working on adding in those states now, especially since the main use-case for this library ends up being preprocessing, which happens in contexts where size doesn't matter so much.

krisselden commented 4 years ago

@rwjblue parse5 likely is a better fit for embroiders use case /cc @ef4

ef4 commented 4 years ago

Agreed. We need a complete parser and serializer.

rwjblue commented 3 years ago

Apologies for not leaving a comment above when I reopened / merged this. I would like to move this forward (and begin expanding the scope of this library slowly) because I believe that the path forward in SSR is to have the template own the full HTML (instead of having the template rendered output spliced into an HTML content string). Doing this fixes some things that are quite annoying today (e.g. rendering custom <head> content from an Ember / Glimmer.js app).

I will try to investigate migrating @glimmer/syntax to leveraging parse5 instead of simple-html-tokenizer though, I'll open another issue on glimmerjs/glimmer-vm for that.

wycats commented 3 years ago

@rwjblue We definitely need to talk about this before you make any further steps in that direction, but I'm not intrinsically opposed to the approach you have in mind.

rwjblue commented 3 years ago

@wycats yep, I was mostly just going to see if it were possible (seems like it should be)

wycats commented 3 years ago

@rwjblue My main concerns would be:

our extensions to valid HTML (tag names that start with @ or :)
the separation of the lexer and parser, as well as "partial lex mode", which allow us to "splice in" {{...}} tokens in places where they would be illegal (or lex incorrectly)
- this allows us to support <a href={{some helper "inner string"}}>, which is very difficult in traditional HTML parsers
our desire to flag some amount of invalid HTML (most "real-world" parsers fully embrace the error-correcting mode) that is consistent with our extensions (@ and : tags, @ attributes, and curlies in many positions that would be invalid in HTML, especially when they contain nested strings)
our ability to directly control the lexer codebase to give correct source locations in error cases (it's not perfect right now, but our control over the codebase has already been useful and would allow us to continue to fix bugs over time).
the size of the codebase for hypothetical future in-browser parsing scenarios (HTML5 parsers tend to be big)

tildeio / simple-html-tokenizer

Add support for parsing `<!DOCTYPE html>` #71