PathBuf serialization can fail (non-streaming) or induce corruption (streaming) on encountering non-UTF8-encodable paths

ssokolow commented 6 years ago

While testing this simple schema, I was lulled into a false sense of security by serde_json providing a serializer for PathBuf by default.

#[derive(Debug, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "lowercase")]
pub enum Entry {
    File {
        path: PathBuf,
        inode: Option<u64>,
        size: u64,
        modified: Option<SystemTime> // FIXME: Work around serde_json bug #464
    },

    Directory {
        path: PathBuf,
        inode: Option<u64>,
        modified: Option<SystemTime> // FIXME: Work around serde_json bug #464
    },

    Symlink {
        path: PathBuf,
        target: Option<PathBuf>
    },

    Unknown { path: PathBuf }
}

Specifically, when I moved on to more thorough testing, serialization failed with this message:

error: serialization failure
caused by: path contains invalid UTF-8 characters

Wallpapering over the fundamental point to having a distinction between String and OsString (which PathBuf wraps) in this "do what I mean and hope it doesn't break" sort of way feels more like what I'd expect of a serializer for a weakly typed language, like Perl or PHP, rather than a "Strongly typed JSON library for Rust", which makes for a big footgun.

If you're unwilling to be opinionated about how to escape/encode arbitrary bytestrings (POSIX) or un-paired UTF-16 surrogates (Windows) into UTF-8 for storage in JSON, I'd much prefer serde_json to take inspiration from Result and force me to explicitly choose a resolution strategy before my code will compile.

If you are willing to be opinionated, I've been pondering the best way to round-trip invalid UTF-8 while minimizing the chance of characters from fully valid Unicode paths changing during the escaping/encoding process and my conclusion has been that the best solution would be to use NULL as an escape character so invalid bytes could be escaped as \0xx and then encoded as \u0000xx with xx being the hexadecimal representation of the byte in question:

Both the ECMA and IETF versions of the JSON spec say that any codepoint may be included in a JSON string in escaped form.
NULL is the only character not allowed in POSIX paths and valid POSIX paths are a superset of valid Win32 paths, so the escaping mechanism won't interfere with the serialized output of any valid paths that were accepted before.
NULL is technically allowed when using paths internal to the Windows NT kernel, since the NT kernel uses Rust-style counted strings and its object manager does allow null bytes in paths, but that's such an out-there edge case that I think it's excusable to require that processing of escapes be implemented for it to round-trip properly.
JSON deserializers in languages using null-terminated strings are already required to handle \u0000 gracefully to protect against exploits via untrusted input.
std::ffi::CStr::from_bytes_with_nul and std::ffi::CString::new return Result values specifically to guard against accidentally using strings with interior nulls, so it wouldn't be a safety concern there.

Failing that, the next-best alternative is to pick an escape character that deserializers are OK with, but is as unlikely to show up in real paths as possible. (The goal still being to accomplish the closest possible analogue to "All 7-bit ASCII is valid UTF-8".) ...so something on the Basic Multilingual Plane for maximum compatibility with things like the character renderer in my gVim, but as esoteric as possible.

ssokolow commented 6 years ago

It turns out that it's worse than I thought.

I'm now using streaming serialization via SerializeSeq::serialize_element and, in debug mode, it still gives me error messages via my handling of the Result while, in --release builds, it silently corrupts the output rather than failing to serialize.

  },
  {
    "type": "file",
    "path": ,
    {
      "type": "file",
      "path": ,
      {

...and I know it's not a problem in my code because I'm just using this and other such patterns (eg. surfacing PermissionDenied) function perfectly well in --release builds.

if let Err(err) = seq.serialize_element(&entry) {
    error!("Error attempting to serialize entry: {}", err);
}

dtolnay commented 6 years ago

Could you provide a minimal example that reproduces the silent corruption case?

ssokolow commented 6 years ago

I'm at the end of my day, but I'll try to post one tomorrow.

ssokolow commented 6 years ago

Bah. I've got a minimum reproduction of the corruption, but the "--release silences the error messages that accompany it" aspect wasn't reproduced, so I now have to make time to prune down my actual project to get a minimal reproducer.

Here's the corruption-only reproduction:

//! Demonstration of serde_json issue #465
//!
//! This outputs corrupted JSON due to the default serializer for
//! std::path::PathBuf not being able to handle all possibly values for
//! PathBuf and `SerializeSeq::serialize_element` not ensuring well-formed
//! output in the failure case.
//!
//! **NOTE:** This does **not** demonstrate the "silent corruption" case of
//! this bug which I've observed in my own codebase. I will produce a second
//! test case which demonstrates that.
//!
//! **NOTE:** This must be run on a POSIX platform as it demonstrates the
//! problem based on POSIX allowing arbitrary bytes in paths and I don't have a
//! Windows machine new enough to run the Rust compiler so I can implement a
//! similar demonstration using the un-paired surrogates that Windows's UTF-16
//! support allows for legacy reasons.

/// std imports
use std::path::PathBuf;
use std::ffi::OsString;
use std::os::unix::ffi::OsStringExt;

/// serde_json imports
#[macro_use] extern crate serde_derive;
extern crate serde;
extern crate serde_json;
use serde_json::Serializer;
use serde::ser::SerializeSeq;
use serde::ser::Serializer as TSerializer;

const GOOD: &[u8] = b"/good/path";
const BAD: &[u8] = b"/mojibake/fran\xe7ais/path";  // latin1 on-disk path on a utf8 distro
const PATHS: &[&[u8]] = &[&GOOD, &GOOD, &BAD, &GOOD, &BAD, &GOOD];

#[derive(Debug, Serialize, Deserialize)]
struct File { path: PathBuf }

fn main() -> Result<(), Box<std::error::Error>> {
    let output: Vec<u8> = Vec::new();

    // Streaming serialization which, in real life, would go into a file to limit RAM usage.
    println!("ATTEMPTING TO SERIALIZE PathBuf CONTAINING INVALID UTF-8...");
    let mut serializer = Serializer::pretty(output);
    {
        let mut seq = (&mut serializer).serialize_seq(None)?;
        for path in PATHS {
            let entry = File { path: PathBuf::from(OsString::from_vec(path.to_vec())) };
            if let Err(err) = seq.serialize_element(&entry) {
                eprintln!("Error attempting to serialize entry: {}", err);
            }
        }
        seq.end()?;
    }

    let result = String::from_utf8(serializer.into_inner())?;
    println!("\nRESULTING JSON:\n{}", result);

    println!("\nATTEMPTING TO PARSE RESULTING JSON...");
    let _: Vec<File> = serde_json::from_str(&result)?;
    Ok(())
}

This situation is trivially easy to encounter, simply by moving a drive formatted in a filesystem which relies on out-of-band path encoding information (ie. "paths are chunks of bytes separated by / and terminated by \0) between two OSes which don't use the same filename encoding.

(In my case, I have a bunch of files that got mojibake'd because they're old enough to have transitioned from a distro that was using latin1 as its filesystem encoding rather than UTF-8 and I just mounted the old filesystem into the new distro.

dtolnay commented 6 years ago

What is your expectation for what this program should print if it were behaving correctly?

ssokolow commented 6 years ago

That depends on what level you look at it, since I see two bugs being exposed:

At the higher level, I'd like invalid UTF-8 in a PathBuf to successfully round-trip through serde_json with minimal complication of the common case and a readable JSON representation. (See my suggestion about use of either \0 or some extremely niche BMP Unicode codepoint as an escape character so as to not necessitate escaping of characters within typical paths.)

If following the "use \0 as an escape character" approach I suggested, that would look like this:
```
  ATTEMPTING TO SERIALIZE PathBuf CONTAINING INVALID UTF-8...

  RESULTING JSON:
  [
    {
      "path": "/good/path"
    },
    {
      "path": "/good/path"
    },
    {
      "path": "/mojibake/fran\u0000\u00e7ais/path"
    },
    {
      "path": "/good/path"
    },
    {
      "path": "/mojibake/fran\u0000\u00e7ais/path"
    },
    {
      "path": "/good/path"
    }
  ]

  ATTEMPTING TO PARSE RESULTING JSON...
```
I want this for three reasons.
1. Obviously, because I need it for my own project.
2. It feels like a Bad Thing™ for serde_json to conflate String and a type that wraps OsString in a language that put so much effort into making conversions between the two explicit.
3. I believe that a flagship serialization solution for a language like Rust should strive to ensure that all bundled serializer implementations for standard library types are either valid for the entire range of the input type or require downstream programmers to explicitly select a resolution strategy, similar to how String makes users choose between from_utf8, which returns a Result and from_utf8_lossy, which substitutes U+FFFD. (Especially when there's such a potential for verbosity to skyrocket when derive ceases to be an option.)

At the next level down, I'd like ser.serialize_element to succeed or fail atomically, so that an error doesn't leave the serialized document in an inconsistent state and code can, thus, recover from a serialization error. (eg. With the example code, which just fires off a log message on Err, File structs which fail to serialize should be omitted from the resulting JSON, but the JSON should still be well-formed.)

That would look like this:

  ATTEMPTING TO SERIALIZE PathBuf CONTAINING INVALID UTF-8...
  Error attempting to serialize entry: path contains invalid UTF-8 characters
  Error attempting to serialize entry: path contains invalid UTF-8 characters

  RESULTING JSON:
  [
    {
      "path": "/good/path"
    },
    {
      "path": "/good/path"
    },
    {
      "path": "/good/path"
    },
    {
      "path": "/good/path"
    }
  ]

  ATTEMPTING TO PARSE RESULTING JSON...

ssokolow commented 6 years ago

Bah again. The "silent" part is a heisenbug that vanished while I was trying to isolate it and now I can't figure out how to get it back.

ssokolow commented 6 years ago

While my main concern is making the serialization and deserialization process for paths within the same platform both infallible and non-destructive, I got to thinking about cross-platform portability for my idea for escaping invalid UTF-8.

Given that POSIX paths containing a mix of UTF-8 and arbitrary bytes can represent a strict superset of what the Win32 APIs allow in paths, if portable escaped paths are desired, I'd say that the ideal solution for portability is to:

Match the ntfs-3g behaviour for rendering un-paired UTF-16 surrogates in a "UTF-8 with arbitrary invalid bytes" path-rendering paradigm to prevent mojibake when moving NTFS-formatted drives and paths back and forth between Linux and Windows hosts.
Decide on the most useful way to represent invalid UTF-8 from UNIX-style "arbitrary bytes" paths under Windows's "UTF-16 with relaxed surrogate invariants" paradigm.

Number 2 is the tricky part, since convenience and round-trippability are at odds.

Successful round-tripping would require keeping some kind of indicator in the deserialized path to distinguish bytes escaped as codepoints from actual codepoints, while the most convenient solution would probably be just to interpret the escaped bytes using whatever Linux codepage serves a role equivalent to the codepage the Windows system has set for legacy applications, since that's likely to be what the escaped bytes were intended to mean if the Linux machine generating the JSON and the Windows machine decoding the JSON are in the same locale.

(That said, it would still only occupy one codepoint on the Windows side as an escape character, leaving paths which lack that codepoint or arbitrary bytes to Just Work™.)

ssokolow commented 4 years ago

Since I'm not seeing any progress on this, here's the code, including some unit tests, that I cooked up for solving the problem on POSIX platforms in a backwards-compatible way. I've released it under the same MIT/Apache2 dual-license as Serde.

https://gist.github.com/ssokolow/0d9f5c5e4a8a37a962875af205bcc723

As I planned, it works by using \0 (which cannot appear in OS paths but is valid in JSON strings) as an escape character indicating that the following codepoint is an escaped byte. (ie. that it was serialized by interpreting the byte as a codepoint in the range 0-255.)

Since \0 should be precluded by Path and PathBuf, this "using \0 as an escape character" technique should be 100% backwards compatible with existing serde stuff but I'm willing to have it also escape \0 as \0\0 if you consider that necessary.

EDIT: Relying on "the OS will never provide a path containing \0" is bugging me. Give me tomorrow and I'll update the Gist so that, if a \0 does wind up in such a string, it'll round-trip successfully.

ssokolow commented 4 years ago

Done. The escape/unescape code in that Gist should now round-trip anything that can occur in an OsStr or OsString and, since serde_json currently panic!s on encountering invalid UTF-8 in a Path or PathBuf and the OS won't allow \0 in them, it should be completely backwards compatible.

Paths which are valid UTF-8 pass through unchanged, just like with current Serde.
Paths containing byte sequences invalid under UTF-8 would cause serde_json to give up on attempting to serialize them, so there's no existing serde_json output to remain compatible with in this case.
OS APIs don't allow paths to contain \0, so any Serde Path or PathBuf output containing it is already corrupted in some fashion and I think it's reasonable to break deserialization of those in order to fix serialization of paths that do exist as in the wild and present lurking footguns for Serde-using codebases.
\0 is valid in JSON strings and spec-compliant implementations are required to handle it properly, so using it as an escape character wouldn't break the spec.
If other schemes for supporting non-UTF-8 paths in JSON do exist, none would do as well at remaining compatible with existing Serde output.

serde-rs / json

PathBuf serialization can fail (non-streaming) or induce corruption (streaming) on encountering non-UTF8-encodable paths #465