html5lib / html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
MIT License
1.13k stars 284 forks source link

Warn about invalid byte sequences #167

Open yurikhan opened 10 years ago

yurikhan commented 10 years ago

HTML 5 Proposed Recommendation §8.2.2 The input byte stream, HTML 5.1 Draft §8.2.2 The input byte stream:

Note: Bytes or sequences of bytes in the original byte stream that did not conform to the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report. [ENCODING]

Test case:

class TestInvalidSequences(unittest.TestCase):
    def test_invalid_sequences(self):
        parser = html5lib.HTMLParser()
        doc = parser.parse(io.BytesIO('<!DOCTYPE html>\xA0'), encoding='ascii')
        self.assertTrue(parser.errors)

Expected behavior: parser.errors is not empty

Observed behavior: parser.errors is empty; doc contains a tree which contains the \uFFFD replacement character in place of the invalid byte.

Cause: In HTMLBinaryInputStream.reset, the codec is constructed with the option 'replace'; the HTMLUnicodeInputStream only reports errors for Unicode code points which were successfully decoded but are either non-characters or surrogates.

gsnedders commented 10 years ago

This should be doable, though perhaps behind a flag given it may well have a notable impact on performance. We essentially want to reimplement PyObject *PyCodec_ReplaceErrors(PyObject *exc) (in Python/codecs.c) but with warnings whenever anything happens.

Note as I said in #166, parser.errors is merely what the spec calls "parse errors", hence these should be handled separately IMO.

yurikhan commented 10 years ago

+1 about separating conformance violations from parse errors. The reporting mechanism could be similar, though.

Shall I update the Expected paragraph above?

gsnedders commented 10 years ago

Nah, don't worry about editing it.