Closed LukasDeco closed 1 year ago
I am wondering if utf8 validation disabling can help? https://www.mongodb.com/docs/drivers/node/current/fundamentals/utf8-validation/ but I don't see such on option on the rust driver.
Unfortunately, Rust strings are required to be utf8, so we can't disable validation. We have considered providing a way to opt in to lossy validation so that invalid sequences are replaced with placeholders rather than erroring; I don't know if that would be helpful for your use case.
You could try loading the document as a RawBson
value instead of your record type; that should defer deserialization of strings until you access that specific field.
It is certainly unexpected that an empty record type still causes the error - I'm looking into that! Is it possible for you to share the document that triggers this behavior?
@abr-egn I've attached the document to here. Again, it is very huge so you are warned. ~12000 lines. I also excluded a field that basically contains content very similar to what is in the "recommendationsByCategory" field but less specifically organized.
I'm hoping I don't have to use RawBson because it sounds like at some point there would still be an issue? I might try that anway though. Thanks so much for your help! gift-profile-utf8-issue.txt
Thanks! I've reproduced this issue with a minimal test case; I'm looking to see why we're eagerly deserializing here, and what mitigations are possible.
Note that the BSON spec does say that strings are utf8 (https://bsonspec.org/spec.html), so having string values that aren't is likely to cause issues in other places as well.
Unfortunately, it turns out that deserializing using Serde requires iterating over all of the fields of the incoming data, so avoiding the error by dropping fields isn't really feasible.
If you're okay with lossy decoding, you can use that by loading the data into a RawDocumentBuf
and parsing your record type out of that with from_slice_utf8_lossy
, e.g.
//inside find_one
async fn find_one<'b, T: DeserializeOwned + Sync + Send + Unpin>(
&self,
filter: Document,
find_one_options: FindOneOptions,
) -> Result<Option<T>, Error> {
let collection = self
.client
.database("Occasionally")
.collection::<RawDocumentBuf>("Profiles");
let found = collection.find_one(filter, find_one_options).await?;
match found {
None => Ok(None),
Some(raw) => {
let lossy = bson::from_slice_utf8_lossy(raw.as_bytes())?;
Ok(Some(lossy))
}
}
}
Any strings in the returned value that were invalid utf8 will have the invalid sequences replaced with placeholder characters.
Can I ask how this data was inserted? If it was via the Rust driver, that points to another bug that we'll need to look into :)
Okay thank you! I'm about to give this a shot...
It works great! Thank you so much.
We have considered providing a way to opt in to lossy validation so that invalid sequences are replaced with placeholders rather than erroring.
Any chance support for this option out-of-the-box would be considered in the future?
Versions/Environment
cargo pkgid mongodb
&cargo pkgid bson
) - mongodb@2.3.1 & bson@2.4.0db.version()
) - 5.0.14Describe the bug
A clear and concise description of what the bug is.
I am attempting to query for a document that is giving me an error about "invalid utf-8 sequence". I've done a lot of googling on this issue but no luck so far. Nothing related to mongodb :(
The document is quite large, so that might be a potential issue, but I'm not sure.
I'm able to query for another document from the same collection without issue, and that document is also quite large, so I'm not sure if the size is the issue.
I have removed all the properties from the struct so I'm not trying to deserliaze anything at this point, just get the document successfully - and I still get this error. 😢
My next move is to manually delete much of the data out of the document or query for a different document... but obviously none of this is ideal. I'd just like to get someone to point me in the right direction on what the cause of this error might be.
Also important to note is I use Mongodb App Services(formerly Realm, formerly Stitch) and I ran schema validations across these documents and everything passes.
Here's the code, but I don't think it helps much:
Any help is greatly appreciated and please let me know if I can provide any other information.