validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Allow specifying charset and/or improve charset detection #79

Open dhouck opened 1 year ago

dhouck commented 1 year ago

Currently the only way to specify the charset is in the document (with BOM or <meta charset=); if the charset is known but not specified in the document, there is no way to specify it.

Additionally, charset detection even with Heuristics.ALL does not always work well; in particular, it fails to recognize UTF-8 at least if the first non-ASCII byte is late in the document. The WHATWG spec recommends that systems are able to recognize UTF-8 even if they arenʼt good at other charsets (as a non-normative note)

The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. When a user agent can examine the whole file, rather than just the preamble, detecting for UTF-8 specifically can be especially effective. [PPUTF8] [UTF8DET]

(This is reproduced with multiple test documents; the smallest is below but another one output the warning method that the UTF-8 character was invalid in Windows-1252, meaning that went with the default which was a particularly bad guess)

<!DOCTYPE html>
<html lang="en">
    <head>
        <link rel="stylesheet" href="https://fred-wang.github.io/mathml.css/mathml.css">
        <title>Circle equation</title>
        <!-- <meta charset="utf-8" /> -->
    </head>
    <body>
        <p>
            The equation
            <math display=inline>
                <mi>y</mi><mo>=</mo><mo>±</mo>
                <msqrt>
                    <msup><mi>r</mi><mn>2</mn></msup>
                    <mo>-</mo>
                    <msup><mi>x</mi><mn>2</mn></msup>
                </msqrt>
            </math>
            produces a circle with radius <math display=inline><mi>r</mi></math>:
            </p>
        <svg width="10em" height="10em" viewBox="0 0 100 100">
            <desc>A circle</desc>
            <circle cx="50" cy="50" r="40" fill="none" stroke="blue" stroke-width="1" />
        </svg>
    </body>
</html>