Tokenizer: Implement character references

Implements spec-compliant errors for character references, but otherwise does not process character references. The character references themselves are emitted as part of their containing text/tag/attr token.

Because character references are not converted into their mapped codepoint(s) (e.g. ¬ -> ¬), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-references

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size.

Some resources:

The DAFSA here needs 3872 nodes encoded as packed struct(u22)s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the @sizeOf(Node) to 6 bytes, and using packed(u32) has ~no difference to packed(u22).

Some examples of similar DAFSA PRs I've made in the past if you're curious about how the DAFSA compares to other approaches:

Here's what the errors look like when using SublimeText:

char-ref-errors-inline

And here's proof that the example from here is handled correctly:

char-ref-errors-spec

Note: It's worth merging https://github.com/kristoff-it/super-html/pull/10 before testing this branch, since it's pretty easy to run into the root node bug that's fixed in that PR

Thank you squeek!!!!! This PR is amazing.

I see that there is a difference between a bad character reference in an attribute value vs outside, according to the spec.

Since this parser is designed primarily for the usecase of supporting human-written HTML, I've diverged from the spec in some occasions when strict adherence would prevent me from detecting a probable human error. As an example, respecting implicitly closed tags would prevent Super from reporting <h1> foo <h1> as an error.

With that in mind, do you think it might make sense to be more strict than the spec wrt bad character references in attributes?

My understanding is that if your intent is to actually write '¬it;' in an attribute, you can always come up with an actually correct encoding, like '&notit;' (I think). That said, people might have different expectations and it would forbid them from using a "shorthand encoding".

What do you think?

kristoff-it / superhtml

Tokenizer: Implement character references #11