Querying for a document and getting "invalid utf-8 sequence of 1 bytes from index 119" despite not even trying to deserialize anything

LukasDeco commented 1 year ago

Versions/Environment

What version of Rust are you using? - 1.65.0
What operating system are you using? - WSL (Linux on windows)
What versions of the driver and its dependencies are you using? (Run cargo pkgid mongodb & cargo pkgid bson) - mongodb@2.3.1 & bson@2.4.0
What version of MongoDB are you using? (Check with the MongoDB shell using db.version()) - 5.0.14
What is your MongoDB topology (standalone, replica set, sharded cluster, serverless)? - Replica Set - 3 nodes

Describe the bug

A clear and concise description of what the bug is.

I am attempting to query for a document that is giving me an error about "invalid utf-8 sequence". I've done a lot of googling on this issue but no luck so far. Nothing related to mongodb :(

The document is quite large, so that might be a potential issue, but I'm not sure.

I'm able to query for another document from the same collection without issue, and that document is also quite large, so I'm not sure if the size is the issue.

I have removed all the properties from the struct so I'm not trying to deserliaze anything at this point, just get the document successfully - and I still get this error. 😢

My next move is to manually delete much of the data out of the document or query for a different document... but obviously none of this is ideal. I'd just like to get someone to point me in the right direction on what the cause of this error might be.

Also important to note is I use Mongodb App Services(formerly Realm, formerly Stitch) and I ran schema validations across these documents and everything passes.

Here's the code, but I don't think it helps much:

// manually setting ID for the query
let id = ObjectId::from_str("6116dc1633616dc8924e1050").unwrap();
        let filter_document = doc! {"userId": id};
        let find_one_options = FindOneOptions::builder().build();
        let profile = self
            .profiles_repository
            .find_one::<Profile >(filter_document, find_one_options)
            .await;

//inside find_one
async fn find_one<'b, T: DeserializeOwned + Sync + Send + Unpin>(
        &self,
        filter: Document,
        find_one_options: FindOneOptions,
    ) -> Result<Option<T>, Error> {
        let collection = self
            .client
            .database("Occasionally")
            .collection::<T>("Profiles");

        collection.find_one(filter, find_one_options).await
    }

// Profile Struct, commented out all the props :(
#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct Profile {
    // pub other_id: Option<ObjectId>,
    // pub complex_prop: Vec<OtherStruct>,
    // pub user_id: Option<ObjectId>,
}

Any help is greatly appreciated and please let me know if I can provide any other information.

LukasDeco commented 1 year ago

I am wondering if utf8 validation disabling can help? https://www.mongodb.com/docs/drivers/node/current/fundamentals/utf8-validation/ but I don't see such on option on the rust driver.

abr-egn commented 1 year ago

Unfortunately, Rust strings are required to be utf8, so we can't disable validation. We have considered providing a way to opt in to lossy validation so that invalid sequences are replaced with placeholders rather than erroring; I don't know if that would be helpful for your use case.

You could try loading the document as a RawBson value instead of your record type; that should defer deserialization of strings until you access that specific field.

It is certainly unexpected that an empty record type still causes the error - I'm looking into that! Is it possible for you to share the document that triggers this behavior?

LukasDeco commented 1 year ago

@abr-egn I've attached the document to here. Again, it is very huge so you are warned. ~12000 lines. I also excluded a field that basically contains content very similar to what is in the "recommendationsByCategory" field but less specifically organized.

I'm hoping I don't have to use RawBson because it sounds like at some point there would still be an issue? I might try that anway though. Thanks so much for your help! gift-profile-utf8-issue.txt

abr-egn commented 1 year ago

Thanks! I've reproduced this issue with a minimal test case; I'm looking to see why we're eagerly deserializing here, and what mitigations are possible.

Note that the BSON spec does say that strings are utf8 (https://bsonspec.org/spec.html), so having string values that aren't is likely to cause issues in other places as well.

abr-egn commented 1 year ago

Unfortunately, it turns out that deserializing using Serde requires iterating over all of the fields of the incoming data, so avoiding the error by dropping fields isn't really feasible.

If you're okay with lossy decoding, you can use that by loading the data into a RawDocumentBuf and parsing your record type out of that with from_slice_utf8_lossy, e.g.

//inside find_one
async fn find_one<'b, T: DeserializeOwned + Sync + Send + Unpin>(
        &self,
        filter: Document,
        find_one_options: FindOneOptions,
    ) -> Result<Option<T>, Error> {
        let collection = self
            .client
            .database("Occasionally")
            .collection::<RawDocumentBuf>("Profiles");

        let found = collection.find_one(filter, find_one_options).await?;
        match found {
            None => Ok(None),
            Some(raw) => {
                let lossy = bson::from_slice_utf8_lossy(raw.as_bytes())?;
                Ok(Some(lossy))
            }
        }
    }

Any strings in the returned value that were invalid utf8 will have the invalid sequences replaced with placeholder characters.

Can I ask how this data was inserted? If it was via the Rust driver, that points to another bug that we'll need to look into :)

LukasDeco commented 1 year ago

Okay thank you! I'm about to give this a shot...

LukasDeco commented 1 year ago

It works great! Thank you so much.

clarkmcc commented 6 months ago

We have considered providing a way to opt in to lossy validation so that invalid sequences are replaced with placeholders rather than erroring.

Any chance support for this option out-of-the-box would be considered in the future?

mongodb / mongo-rust-driver

Querying for a document and getting "invalid utf-8 sequence of 1 bytes from index 119" despite not even trying to deserialize anything #799

Versions/Environment

Describe the bug