Closed cgaebel closed 9 years ago
U+FFFE is a noncharacter, but it doesn't make the corresponding UTF-8 sequence (EF BF BE
) invalid! Quoting the section 23.7 in the Unicode standard 7.0:
Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. However, they are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed. The intent of noncharacters is that they are permanently prohibited from being assigned interchangeable meanings by the Unicode Standard. They are not prohibited from occurring in valid Unicode strings which happen to be interchanged. This distinction, which might be seen as too finely drawn, ensures that noncharacters are correctly preserved when "interchanged" internally, as when used in strings in APIs, in other interprocess protocols, or when stored.
There are also a number of noncharacters, including U+FDD0..FDEF reserved for the Arabic processing, and none of them are prohibited in UTF-8. Rust's char
happily accepts them. (Try '\ufffe'
:)
Ah. I tired looking up invalid utf-8 and that's what I found. Silly me! Can you give me an example of something which is invalid utf-8?
@cgaebel Rust-encoding has a full test suite for the invalid UTF-8 sequences.
Ahhh I missed the processed > 0
condition when looking for this "bug". Thanks for pointing me in the right direction!
This test successfully reports an error, but when it does it writes an invalid code sequence into the buffer.
(side note, github markup is eating the invalid UTF-8 char in
left
. Rest assured SOMETHING is in there.