Open ssokolow opened 6 years ago
It turns out that it's worse than I thought.
I'm now using streaming serialization via SerializeSeq::serialize_element
and, in debug mode, it still gives me error messages via my handling of the Result
while, in --release
builds, it silently corrupts the output rather than failing to serialize.
},
{
"type": "file",
"path": ,
{
"type": "file",
"path": ,
{
...and I know it's not a problem in my code because I'm just using this and other such patterns (eg. surfacing PermissionDenied
) function perfectly well in --release
builds.
if let Err(err) = seq.serialize_element(&entry) {
error!("Error attempting to serialize entry: {}", err);
}
Could you provide a minimal example that reproduces the silent corruption case?
I'm at the end of my day, but I'll try to post one tomorrow.
Bah. I've got a minimum reproduction of the corruption, but the "--release
silences the error messages that accompany it" aspect wasn't reproduced, so I now have to make time to prune down my actual project to get a minimal reproducer.
Here's the corruption-only reproduction:
//! Demonstration of serde_json issue #465
//!
//! This outputs corrupted JSON due to the default serializer for
//! std::path::PathBuf not being able to handle all possibly values for
//! PathBuf and `SerializeSeq::serialize_element` not ensuring well-formed
//! output in the failure case.
//!
//! **NOTE:** This does **not** demonstrate the "silent corruption" case of
//! this bug which I've observed in my own codebase. I will produce a second
//! test case which demonstrates that.
//!
//! **NOTE:** This must be run on a POSIX platform as it demonstrates the
//! problem based on POSIX allowing arbitrary bytes in paths and I don't have a
//! Windows machine new enough to run the Rust compiler so I can implement a
//! similar demonstration using the un-paired surrogates that Windows's UTF-16
//! support allows for legacy reasons.
/// std imports
use std::path::PathBuf;
use std::ffi::OsString;
use std::os::unix::ffi::OsStringExt;
/// serde_json imports
#[macro_use] extern crate serde_derive;
extern crate serde;
extern crate serde_json;
use serde_json::Serializer;
use serde::ser::SerializeSeq;
use serde::ser::Serializer as TSerializer;
const GOOD: &[u8] = b"/good/path";
const BAD: &[u8] = b"/mojibake/fran\xe7ais/path"; // latin1 on-disk path on a utf8 distro
const PATHS: &[&[u8]] = &[&GOOD, &GOOD, &BAD, &GOOD, &BAD, &GOOD];
#[derive(Debug, Serialize, Deserialize)]
struct File { path: PathBuf }
fn main() -> Result<(), Box<std::error::Error>> {
let output: Vec<u8> = Vec::new();
// Streaming serialization which, in real life, would go into a file to limit RAM usage.
println!("ATTEMPTING TO SERIALIZE PathBuf CONTAINING INVALID UTF-8...");
let mut serializer = Serializer::pretty(output);
{
let mut seq = (&mut serializer).serialize_seq(None)?;
for path in PATHS {
let entry = File { path: PathBuf::from(OsString::from_vec(path.to_vec())) };
if let Err(err) = seq.serialize_element(&entry) {
eprintln!("Error attempting to serialize entry: {}", err);
}
}
seq.end()?;
}
let result = String::from_utf8(serializer.into_inner())?;
println!("\nRESULTING JSON:\n{}", result);
println!("\nATTEMPTING TO PARSE RESULTING JSON...");
let _: Vec<File> = serde_json::from_str(&result)?;
Ok(())
}
This situation is trivially easy to encounter, simply by moving a drive formatted in a filesystem which relies on out-of-band path encoding information (ie. "paths are chunks of bytes separated by /
and terminated by \0
) between two OSes which don't use the same filename encoding.
(In my case, I have a bunch of files that got mojibake'd because they're old enough to have transitioned from a distro that was using latin1 as its filesystem encoding rather than UTF-8 and I just mounted the old filesystem into the new distro.
What is your expectation for what this program should print if it were behaving correctly?
That depends on what level you look at it, since I see two bugs being exposed:
At the higher level, I'd like invalid UTF-8 in a PathBuf
to successfully round-trip through serde_json
with minimal complication of the common case and a readable JSON representation. (See my suggestion about use of either \0
or some extremely niche BMP Unicode codepoint as an escape character so as to not necessitate escaping of characters within typical paths.)
If following the "use \0
as an escape character" approach I suggested, that would look like this:
ATTEMPTING TO SERIALIZE PathBuf CONTAINING INVALID UTF-8...
RESULTING JSON:
[
{
"path": "/good/path"
},
{
"path": "/good/path"
},
{
"path": "/mojibake/fran\u0000\u00e7ais/path"
},
{
"path": "/good/path"
},
{
"path": "/mojibake/fran\u0000\u00e7ais/path"
},
{
"path": "/good/path"
}
]
ATTEMPTING TO PARSE RESULTING JSON...
I want this for three reasons.
serde_json
to conflate String
and a type that wraps OsString
in a language that put so much effort into making conversions between the two explicit.String
makes users choose between from_utf8
, which returns a Result
and from_utf8_lossy
, which substitutes U+FFFD
. (Especially when there's such a potential for verbosity to skyrocket when derive
ceases to be an option.)At the next level down, I'd like ser.serialize_element
to succeed or fail atomically, so that an error doesn't leave the serialized document in an inconsistent state and code can, thus, recover from a serialization error. (eg. With the example code, which just fires off a log message on Err
, File
structs which fail to serialize should be omitted from the resulting JSON, but the JSON should still be well-formed.)
That would look like this:
ATTEMPTING TO SERIALIZE PathBuf CONTAINING INVALID UTF-8...
Error attempting to serialize entry: path contains invalid UTF-8 characters
Error attempting to serialize entry: path contains invalid UTF-8 characters
RESULTING JSON:
[
{
"path": "/good/path"
},
{
"path": "/good/path"
},
{
"path": "/good/path"
},
{
"path": "/good/path"
}
]
ATTEMPTING TO PARSE RESULTING JSON...
Bah again. The "silent" part is a heisenbug that vanished while I was trying to isolate it and now I can't figure out how to get it back.
While my main concern is making the serialization and deserialization process for paths within the same platform both infallible and non-destructive, I got to thinking about cross-platform portability for my idea for escaping invalid UTF-8.
Given that POSIX paths containing a mix of UTF-8 and arbitrary bytes can represent a strict superset of what the Win32 APIs allow in paths, if portable escaped paths are desired, I'd say that the ideal solution for portability is to:
ntfs-3g
behaviour for rendering un-paired UTF-16 surrogates in a "UTF-8 with arbitrary invalid bytes" path-rendering paradigm to prevent mojibake when moving NTFS-formatted drives and paths back and forth between Linux and Windows hosts.Number 2 is the tricky part, since convenience and round-trippability are at odds.
Successful round-tripping would require keeping some kind of indicator in the deserialized path to distinguish bytes escaped as codepoints from actual codepoints, while the most convenient solution would probably be just to interpret the escaped bytes using whatever Linux codepage serves a role equivalent to the codepage the Windows system has set for legacy applications, since that's likely to be what the escaped bytes were intended to mean if the Linux machine generating the JSON and the Windows machine decoding the JSON are in the same locale.
(That said, it would still only occupy one codepoint on the Windows side as an escape character, leaving paths which lack that codepoint or arbitrary bytes to Just Work™.)
Since I'm not seeing any progress on this, here's the code, including some unit tests, that I cooked up for solving the problem on POSIX platforms in a backwards-compatible way. I've released it under the same MIT/Apache2 dual-license as Serde.
https://gist.github.com/ssokolow/0d9f5c5e4a8a37a962875af205bcc723
As I planned, it works by using \0
(which cannot appear in OS paths but is valid in JSON strings) as an escape character indicating that the following codepoint is an escaped byte. (ie. that it was serialized by interpreting the byte as a codepoint in the range 0-255.)
Since \0
should be precluded by Path
and PathBuf
, this "using \0
as an escape character" technique should be 100% backwards compatible with existing serde stuff but I'm willing to have it also escape \0
as \0\0
if you consider that necessary.
EDIT: Relying on "the OS will never provide a path containing \0
" is bugging me. Give me tomorrow and I'll update the Gist so that, if a \0
does wind up in such a string, it'll round-trip successfully.
Done. The escape/unescape code in that Gist should now round-trip anything that can occur in an OsStr
or OsString
and, since serde_json
currently panic!
s on encountering invalid UTF-8 in a Path
or PathBuf
and the OS won't allow \0
in them, it should be completely backwards compatible.
\0
, so any Serde Path
or PathBuf
output containing it is already corrupted in some fashion and I think it's reasonable to break deserialization of those in order to fix serialization of paths that do exist as in the wild and present lurking footguns for Serde-using codebases.\0
is valid in JSON strings and spec-compliant implementations are required to handle it properly, so using it as an escape character wouldn't break the spec.
While testing this simple schema, I was lulled into a false sense of security by
serde_json
providing a serializer forPathBuf
by default.Specifically, when I moved on to more thorough testing, serialization failed with this message:
Wallpapering over the fundamental point to having a distinction between
String
andOsString
(whichPathBuf
wraps) in this "do what I mean and hope it doesn't break" sort of way feels more like what I'd expect of a serializer for a weakly typed language, like Perl or PHP, rather than a "Strongly typed JSON library for Rust", which makes for a big footgun.If you're unwilling to be opinionated about how to escape/encode arbitrary bytestrings (POSIX) or un-paired UTF-16 surrogates (Windows) into UTF-8 for storage in JSON, I'd much prefer serde_json to take inspiration from
Result
and force me to explicitly choose a resolution strategy before my code will compile.If you are willing to be opinionated, I've been pondering the best way to round-trip invalid UTF-8 while minimizing the chance of characters from fully valid Unicode paths changing during the escaping/encoding process and my conclusion has been that the best solution would be to use NULL as an escape character so invalid bytes could be escaped as
\0xx
and then encoded as\u0000xx
withxx
being the hexadecimal representation of the byte in question:\u0000
gracefully to protect against exploits via untrusted input.std::ffi::CStr::from_bytes_with_nul
andstd::ffi::CString::new
returnResult
values specifically to guard against accidentally using strings with interior nulls, so it wouldn't be a safety concern there.Failing that, the next-best alternative is to pick an escape character that deserializers are OK with, but is as unlikely to show up in real paths as possible. (The goal still being to accomplish the closest possible analogue to "All 7-bit ASCII is valid UTF-8".) ...so something on the Basic Multilingual Plane for maximum compatibility with things like the character renderer in my gVim, but as esoteric as possible.