mongodb / mongo-rust-driver

The official MongoDB Rust Driver
https://www.mongodb.com/docs/drivers/rust/current/
Apache License 2.0
1.44k stars 164 forks source link

Add deserialize-utf8-lossy feature to always deserialize using lossy UTF-8 conversion #1187

Closed tyilo closed 2 weeks ago

tyilo commented 1 month ago

Useful if you need to read from a collection created by a driver for another programming language.

See #799

tyilo commented 1 month ago

Also consider making this the default to improve interoperability with other drivers.

abr-egn commented 1 month ago

Thanks for the contribution! I don't think we'll want to make this the default to avoid breaking anyone relying on the existing validation, but it seems quite helpful as an opt-in feature.

If you're willing, could you share more information about the situation that motivates this? The bson spec says that strings are UTF-8, so at least in theory drivers shouldn't be writing values that require this.

tyilo commented 1 month ago

If you're willing, could you share more information about the situation that motivates this? The bson spec says that strings are UTF-8, so at least in theory drivers shouldn't be writing values that require this.

Yes, I agree that in a perfect world this would be rejected by all drivers and also the MongoDB server itself.

However, at least the Java driver doesn't always produce valid UTF-8 when writing to a collection. See https://jira.mongodb.org/projects/JAVA/issues/JAVA-5575 which has been closed as "Won't Fix".

isabelatkinson commented 3 weeks ago

Hi @tyilo, apologies for the delay here! UTF-8 lossy deserialization is a concept that we'd like to keep contained to the bson library if possible. Would the wrapper type added in https://github.com/mongodb/bson-rust/pull/497 work for you rather than exposing this feature flag?

Here's the basic idea of what using this type would look like:

struct HasLossyString {
    // this string might have invalid utf-8
    s: Utf8LossyDeserialization<String>,
}

let collection: Collection<HasLossyString> = client.database("db").collection("coll");
// documents with invalid utf-8 strings in field s will not error
let cursor = collection.find(filter).await?;

// both s1 and s2 might have invalid utf-8
struct HasLossyStrings {
    s1: String,
    s2: String,
}

let collection: Collection<Utf8LossyDeserialization<HasLossyStrings>> = client.database("db").collection("coll");
// documents with any invalid utf-8 string values will not error
let cursor = collection.find(filter).await?;

You could also use this type with Document/RawDocumentBuf.

tyilo commented 3 weeks ago

@isabelatkinson Seems to work great.

isabelatkinson commented 2 weeks ago

@tyilo great to hear. I just merged in the addition to the BSON library, and it will be included in that crate's next release. Going to close this out - thanks for bringing this issue to our attention!