tildeio / simple-html-tokenizer

A lightweight JavaScript library for tokenizing non-`<script>` HTML expected to be found in the `<body>` of a document
MIT License
84 stars 33 forks source link

Add support for parsing `<!DOCTYPE html>` #71

Closed rwjblue closed 3 years ago

rwjblue commented 5 years ago

The spec says this about <!DOCTYPE:

DOCTYPEs are required for legacy reasons. When omitted, browsers tend to use a different rendering mode that is incompatible with some specifications. Including the DOCTYPE in a document ensures that the browser makes a best-effort attempt at following the relevant specifications.

This fixes an issue where we would end up in an incorrect state when the <!DOCTYPE declaration was found (e.g. https://github.com/ember-template-lint/ember-template-lint/issues/719).

Addresses https://github.com/ember-template-lint/ember-template-lint/issues/719 Addresses https://github.com/stefanpenner/find-scripts-srcs-in-document/issues/1


The specific breaking changes here are that the delegate now must have the following new methods:

  beginDoctype(): void;
  appendToDoctypeName(char: string): void;
  appendToDoctypePublicIdentifier(char: string): void;
  appendToDoctypeSystemIdentifier(char: string): void;
  endDoctype(): void;

Closes https://github.com/tildeio/simple-html-tokenizer/issues/28.

Turbo87 commented 4 years ago

The specific breaking changes here are that the delegate now must have the following new methods

can we make this non-breaking by calling those methods only if they exist?

wycats commented 4 years ago

for the benefit of historical info: the original theory of this library is that it was basically for body templates, and therefore I didn't implement the states for doctype/script, etc. This was in the interest of keeping the library reasonably small: the states I left out are something like half of all tokenizer states!

I have no problem with working on adding in those states now, especially since the main use-case for this library ends up being preprocessing, which happens in contexts where size doesn't matter so much.

krisselden commented 4 years ago

@rwjblue parse5 likely is a better fit for embroiders use case /cc @ef4

ef4 commented 4 years ago

Agreed. We need a complete parser and serializer.

rwjblue commented 3 years ago

Apologies for not leaving a comment above when I reopened / merged this. I would like to move this forward (and begin expanding the scope of this library slowly) because I believe that the path forward in SSR is to have the template own the full HTML (instead of having the template rendered output spliced into an HTML content string). Doing this fixes some things that are quite annoying today (e.g. rendering custom <head> content from an Ember / Glimmer.js app).

I will try to investigate migrating @glimmer/syntax to leveraging parse5 instead of simple-html-tokenizer though, I'll open another issue on glimmerjs/glimmer-vm for that.

wycats commented 3 years ago

@rwjblue We definitely need to talk about this before you make any further steps in that direction, but I'm not intrinsically opposed to the approach you have in mind.

rwjblue commented 3 years ago

@wycats yep, I was mostly just going to see if it were possible (seems like it should be)

wycats commented 3 years ago

@rwjblue My main concerns would be: