mbdavid / LiteDB

LiteDB - A .NET NoSQL Document Store in a single data file
http://www.litedb.org
MIT License
8.62k stars 1.25k forks source link

[BUG] LiteDB replaces unpaired surrogate pair to U+FFFD (REPLACEMENT CHARACTER) #2406

Closed anatawa12 closed 9 months ago

anatawa12 commented 10 months ago

Version Which LiteDB version/OS/.NET framework version are you using. (REQUIRED)

Describe the bug A clear and concise description of what the bug is.

LiteDB replaces unpaired surrogate pair tp U+FFFD silently.

Code to Reproduce Write a small snippet to isolate your bug and could be possible to our team test. (REQUIRED)

using (var db = new LiteDatabase("test.litedb"))
{
    var collection = db.GetCollection("test");
    var document = new BsonDocument()
    {
        { "UnpairedSurrogate", "\uD800" },
    };
    Console.WriteLine("Before insert: " + (document["UnpairedSurrogate"].AsString == "\uD800")); // Prints True: OK
    var inserted = collection.FindById(collection.Insert(document));
    Console.WriteLine("After insert: " + (inserted["UnpairedSurrogate"].AsString == "\uD800")); // Prints False: bad. should be True
}

Expected behavior A clear and concise description of what you expected to happen.

One of the following is expected

  1. Preserve unpaired surrogate pair after serialization
  2. Throw exception on implicit operator BsonValue(string value) since not supported by Bson
  3. Throw exception on ILiteCollection.Insert since not supported by LiteDB

Screenshots/Stacktrace If applicable, add screenshots/stacktrace

Additional context Add any other context about the problem here.

Unpaired surrogate pair is valid for windows path name so I think preserving is the best. However, unpaired surrogate pair is not valid character in UTF-8 so I think it's reasonable to not support unpaired surrogate.

For preserving unpaired surrogate pair with backwards compatible (and forward compatible if string is valid UTF8) way, you may use WTF-8, an extension of UTF-8 used by rust and go to preserve unpaired surrogate pair in windows paths.

The reason why LiteDB replaces unpaired surrogate pair to U+FFFD silently is LiteDB uses Encoding.UTF8 which is new UTF8Encoding(true, false). I think LiteDB should use new UTF8Encoding(false, true).