Open yurikhan opened 10 years ago
This should be doable, though perhaps behind a flag given it may well have a notable impact on performance. We essentially want to reimplement PyObject *PyCodec_ReplaceErrors(PyObject *exc)
(in Python/codecs.c
) but with warnings whenever anything happens.
Note as I said in #166, parser.errors
is merely what the spec calls "parse errors", hence these should be handled separately IMO.
+1 about separating conformance violations from parse errors. The reporting mechanism could be similar, though.
Shall I update the Expected paragraph above?
Nah, don't worry about editing it.
HTML 5 Proposed Recommendation §8.2.2 The input byte stream, HTML 5.1 Draft §8.2.2 The input byte stream:
Test case:
Expected behavior:
parser.errors
is not emptyObserved behavior:
parser.errors
is empty;doc
contains a tree which contains the\uFFFD
replacement character in place of the invalid byte.Cause: In
HTMLBinaryInputStream.reset
, the codec is constructed with the option'replace'
; theHTMLUnicodeInputStream
only reports errors for Unicode code points which were successfully decoded but are either non-characters or surrogates.