html5lib / html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
MIT License
1.12k stars 283 forks source link

AssertionError: 'iso-8859-2' == 'windows-1252' during encoding test #433

Open awilfox opened 4 years ago

awilfox commented 4 years ago

Python 3.6.9 on Linux kernel 5.4.5 with musl libc 1.2.0.

    def test_encoding():
        for filename in get_data_files("encoding"):
            tests = _TestData(filename, b"data", encoding=None)
            for test in tests:
>               runParserEncodingTest(test[b'data'], test[b'encoding'])

html5lib/tests/test_encoding.py:102: 
_ _ _ _
data = b'<!DOCTYPE HTML>\n<script>document.write(\'<meta charset="ISO-8859-\' + \'2">\')</script>', encoding = 'iso-8859-2'

    def runParserEncodingTest(data, encoding):
        p = HTMLParser()
        assert p.documentEncoding is None
        p.parse(data, useChardet=False)
        encoding = encoding.lower().decode("ascii")

>       assert encoding == p.documentEncoding, errorMessage(data, encoding, p.documentEncoding)
E       AssertionError: Input:
E         b'<!DOCTYPE HTML>\n<script>document.write(\'<meta charset="ISO-8859-\' + \'2">\')</script>'
E         Expected:
E         'iso-8859-2'
E         Recieved
E         'windows-1252'
E         
E       assert 'iso-8859-2' == 'windows-1252'
E         - iso-8859-2
E         + windows-1252

html5lib/tests/test_encoding.py:84: AssertionError
==== short test summary info ====
FAILED html5lib/tests/test_encoding.py::test_encoding - AssertionError: Input:
==== 1 failed, 15715 passed, 14980 skipped, 666 xfailed in 137.44s (0:02:17) ====
openandclose commented 4 years ago

Not directly related, but is the test data itseif right, spec-wise?

A light investigation tells me it is. (Numerous reservations and exceptions in 'document.write' part of the html spec, notably WICG/interventions#17, doesn't apply here, I think).

But Firefox and Chrome think differently

<!DOCTYPE HTML>
<script>document.write('<meta charset="ISO-8859-' + '2">')</script>

Firefox: UTF-8 Chrome: windows-1252

For comparison:

<!DOCTYPE HTML>
<script>document.write('<meta charset="ISO-8859-2">')</script>

Firefox: ISO-8859-2 Chrome: windows-1252

(Firefox 72.0, Chromium 79.0.3945.117, document.characterSet in web consle)