ikorin24 / U8XmlParser

Extremely fast UTF-8 xml parser library
MIT License
95 stars 13 forks source link

`RawString.StartsWith` and `RawString.EndsWith` treats any unpaired surrogate in argument string as "�" #10

Closed RamType0 closed 2 years ago

RamType0 commented 2 years ago

RawString.StartsWith and RawString.EndsWith treats any unpaired surrogate in argument string as "�". So...

[Fact]
        public unsafe void UnpairedSurrogateComparison()
        {
            // "\ufffd" == "�" It is the default fallback character for UTF8Encoding
            const string FallbackCharStr = "\ufffd";
            // "\ud83d" is one of the surrogate
            const string SurrogateCharStr = "\ud83d";
            var fallbackCharUtf8Bytes = Encoding.UTF8.GetBytes(FallbackCharStr);
            fixed(byte* ptr = fallbackCharUtf8Bytes) {
                var fallbackCharRawStr = new RawString(ptr, fallbackCharUtf8Bytes.Length);
                Assert.False(fallbackCharRawStr.StartsWith(SurrogateCharStr));
                Assert.False(fallbackCharRawStr.EndsWith(SurrogateCharStr));
            }
        }

This kind of test fails.

ikorin24 commented 2 years ago

Throw an exception when converting invalid strings to UTF-8, such as those containing unpaired surrogates.

[Fact]
public unsafe void UnpairedSurrogateComparison()
{
    // "\ufffd" == "�" It is the default fallback character for UTF8Encoding
    const string FallbackCharStr = "\ufffd";
    // "\ud83d" is one of the surrogate
    const string SurrogateCharStr = "\ud83d";
    var fallbackCharUtf8Bytes = UTF8ExceptionFallbackEncoding.Instance.GetBytes(FallbackCharStr);
    fixed(byte* ptr = fallbackCharUtf8Bytes) {
        var fallbackCharRawStr = new RawString(ptr, fallbackCharUtf8Bytes.Length);
        Assert.Throws<EncoderFallbackException>(() => fallbackCharRawStr.StartsWith(SurrogateCharStr));
        Assert.Throws<EncoderFallbackException>(() => fallbackCharRawStr.EndsWith(SurrogateCharStr));
    }
}