lifthrasiir / rust-encoding

Character encoding support for Rust
MIT License
284 stars 59 forks source link

Incrementally parsed invalid sequences spanning multiple chunks write data #52

Closed cgaebel closed 9 years ago

cgaebel commented 9 years ago
    #[test]
    fn test_invalid_multibyte_span() {
        use std::mem;
        let mut d = UTF8Encoding.decoder();
        // "ef bf be" is an invalid sequence.
        assert_feed_ok!(d, [], [0xef, 0xbf], "");
        let input: [u8, ..1] = [ 0xbe ];
        let (_, _, buf) = unsafe { d.test_feed(mem::transmute(input.as_slice())) };
        // Make sure no data was written to the buffer.
        assert_eq!(buf, String::new());
        // task 'codec::utf_8::tests::test_invalid_multibyte_span' failed at 'assertion failed: `(left == right) && (right == left)` (left: `￾`, right: ``)', /Users/cgaebel/code/rust-encoding/src/codec/utf_8.rs:529
    }

This test successfully reports an error, but when it does it writes an invalid code sequence into the buffer.

(side note, github markup is eating the invalid UTF-8 char in left. Rest assured SOMETHING is in there.

lifthrasiir commented 9 years ago

U+FFFE is a noncharacter, but it doesn't make the corresponding UTF-8 sequence (EF BF BE) invalid! Quoting the section 23.7 in the Unicode standard 7.0:

Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. However, they are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed. The intent of noncharacters is that they are permanently prohibited from being assigned interchangeable meanings by the Unicode Standard. They are not prohibited from occurring in valid Unicode strings which happen to be interchanged. This distinction, which might be seen as too finely drawn, ensures that noncharacters are correctly preserved when "interchanged" internally, as when used in strings in APIs, in other interprocess protocols, or when stored.

There are also a number of noncharacters, including U+FDD0..FDEF reserved for the Arabic processing, and none of them are prohibited in UTF-8. Rust's char happily accepts them. (Try '\ufffe' :)

cgaebel commented 9 years ago

Ah. I tired looking up invalid utf-8 and that's what I found. Silly me! Can you give me an example of something which is invalid utf-8?

lifthrasiir commented 9 years ago

@cgaebel Rust-encoding has a full test suite for the invalid UTF-8 sequences.

cgaebel commented 9 years ago

Ahhh I missed the processed > 0 condition when looking for this "bug". Thanks for pointing me in the right direction!