rehypejs / rehype-minify

plugins to minify HTML
https://unifiedjs.com
MIT License
89 stars 16 forks source link

The Meta-charset attribute must always equal "utf-8" (case-insensitive) #46

Closed binyamin closed 2 years ago

binyamin commented 2 years ago

Initial checklist

Affected packages and versions

rehype-minify-enumerated-attribute@4.1.0

Link to runnable example

No response

Steps to reproduce

Run the following code, modifying for your environment as necessary

import { unified } from 'https://esm.sh/unified@10.1.2';
import rehypeParse from 'https://esm.sh/rehype-parse@8.0.4';
import rehypeStringify from 'https://esm.sh/rehype-stringify@9.0.3';
import rehypeMinifyEnumAttributes from 'https://esm.sh/rehype-minify-enumerated-attribute@4.1.0'

const file = await unified()
    .use(rehypeParse)
    .use(rehypeMinifyEnumAttributes)
    .use(rehypeStringify)
    .process('<meta charset="utf-8" />');
console.log(file.toString()); // output is `...<meta charset="utf8">...`

Expected behavior

According to the WHATWG's HTML Living Standard, the charset attribute must be an ASCII case-insensitive match for the string "utf-8". (source). The W3 HTML validator accordingly produces errors.

Therefore, expected behavior is that, when the charset attribute exists on a meta element, its value is either coerced to a valid value, or at least not coerced to "utf8" (no space).

Actual behavior

UTF-8 becomes utf8, probably because it's the shortest string defined in schema.js.

Runtime

Deno

Package manager

No response

OS

Linux

Build and bundle tools

No response

wooorm commented 2 years ago

Hi Binyamin! You’re right that the spec says that. However, that’s more of a section on how people should write HTML (sometimes the spec is a bit unclear about them). There are different actual labels that browsers must support: https://encoding.spec.whatwg.org/#names-and-labels. utf8 is the shortest one of that group. See here for how those groups (and more) are supported:

https://github.com/rehypejs/rehype-minify/blob/1dc9280c341087a40dfaa332792c095f96d41686/packages/html-enumerated-attributes/index.js#L69

This project in many places uses these “leniencies” in the HTML spec to create “parse errors” and other things that linters would warn about, while the HTML spec in other places defines exactly how those errors/warnings must be handled by browsers.

In this case,

Closing as intentional, let me know if you have further questions!

binyamin commented 2 years ago

@wooorm That makes sense. However, it seems only fair that "utf-8" should be preserved as is, at least via an option. It only adds one character, and it encourages conformant HTML.

wooorm commented 2 years ago