proptest-rs / proptest

Hypothesis-like property testing for Rust
Apache License 2.0
1.69k stars 158 forks source link

`bytes_regex` should permit generation of byte sequences that are invalid UTF-8 #336

Closed zackw closed 1 year ago

zackw commented 1 year ago

This strategy function

fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
    prop::string::bytes_regex(
        r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)"
    ).unwrap()
}

is intended to generate, among other things, invalid UTF-8 byte sequences, because it's for testing a parser that works directly from data on disk that cannot be trusted. What it actually does is crash on the unwrap() with

thread 'attrs::test::parse_xattr_ts_invalid' panicked at 'called `Result::unwrap()` on an `Err` value:
    RegexSyntax(Translate(Error {
        kind: InvalidUtf8,
        pattern: "(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\\.[0-9]*[^0-9].*)",
        span: Span(Position(o: 7, l: 1, c: 8), Position(o: 13, l: 1, c: 14))
    }))'

Looking at the code, I believe the change needed is for bytes_regex to have its own version of regex_to_hir that calls ParserBuilder::allow_invalid_utf8(true).

This change should have no visible effect on any existing code that uses bytes_regex, since one must also opt into generation of invalid UTF-8 within the regex itself (that's one of the things the (?s-u: does) and any existing regex that uses that flag must be using it in a way that actually can't generate invalid UTF-8, or else they'd hit the same crash I'm hitting.

matthew-russo commented 1 year ago

Hey sorry for the delay in getting to this. Gave this a look -- bytes_regex doesn't allow this but there is a public method bytes_regex_parsed that just takes in &regex_syntax::hir::Hir, allowing you to generate that yourself with a customized regex_syntax::ParserBuilder which should address your needs. Given the workaround is straightforward and works generally for other regex parsing options, I'm inclined to leave things as is and not introduce a new, specific helper but I'm open to hearing arguments for that.

I'm going to close this but feel free to reopen if there are outstanding questions/discussions to be had

The proptest method: https://docs.rs/proptest/latest/proptest/string/fn.bytes_regex_parsed.html

A working example with your pattern:

╰─⠠⠵ cat Cargo.toml
[package]
name = "proptest-playground"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
proptest = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
proptest-derive = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
proptest-state-machine = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
regex-syntax = { version = "0.7" }
╰─⠠⠵ cat src/main.rs
fn main() {
    println!("Hello, world!");
}

#[cfg(test)]
mod test {
    use proptest::prelude::*;
    use regex_syntax::hir::Hir;
    use regex_syntax::{ParserBuilder, Error};

    fn regex_to_hir(pattern: &str) -> Result<Hir, Error> {
        let mut parser = ParserBuilder::new().utf8(false).build();
        Ok(parser.parse(pattern)?)
    }

    fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
        let pattern = r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)";
        let hir = regex_to_hir(pattern).unwrap();
        prop::string::bytes_regex_parsed(&hir).unwrap()
    }

    proptest!(
        #![proptest_config(ProptestConfig::with_cases(3))]

        #[test]
        fn simple(bytes in invalid_ts()) {
            println!("{:?}", bytes);
        }
    );
}
╰─⠠⠵ cargo test -- --nocapture
   Compiling proptest-playground v0.1.0 (/Users/matthewrusso/projects/user/proptest-playground)
    Finished test [unoptimized + debuginfo] target(s) in 0.23s
     Running unittests src/main.rs (target/debug/deps/proptest_playground-e32ba8f312713645)

running 1 test
[57, 57, 56, 50, 53, 49, 57, 54, 48, 48, 49, 50, 54, 48, 48, 49, 55, 56, 52, 51, 56, 57, 51, 52, 55, 50, 52, 47, 188, 55, 82, 114, 39, 118, 18, 8, 205, 202, 3, 115, 54, 181, 186, 253, 21, 165, 164, 197, 71, 172, 43, 208, 128, 183]
[54, 54, 56, 57, 55, 53, 55, 48, 52, 51, 48, 52, 51, 56, 50, 51, 55, 56, 51, 53, 49, 56, 51, 56, 46, 53, 55, 35, 35, 11, 196, 222, 134, 231, 202, 233, 40, 95, 229, 91, 72, 4, 119, 112, 41, 25, 251, 237, 184, 36, 157, 51, 128, 215]
[57, 53, 57, 51, 55, 51, 49, 57, 52, 56, 48, 56, 56, 56, 49, 49, 55, 49, 50, 49, 56, 51, 55, 57, 49, 54, 47, 248, 53, 205, 23, 39, 137, 171, 14, 133, 3, 102, 253, 209, 228, 29, 98]
[52, 53, 5, 201, 75, 108, 145, 227, 247, 55, 123]
test test::simple ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
zackw commented 1 year ago

I'm going to ask you to reconsider closing this, and instead to merge #337, for two reasons:

  1. bytes_regex already exists. Allowing people to use (?-u) in bytes_regex regexes makes it significantly more useful. I can't presently think of any situation where I would want to use bytes_regex without (?-u), although that may be biased by the kind of tests I'm currently writing (e.g. https://git.sr.ht/~zackw/file-integrity-rs/tree/c8ec4a33d3b4d75648e7568587eeb7a0f4b6c529/item/file_integrity/src/test/attrs.rs#L106).
  2. The authors of the regex_syntax crate describe it as "an implementation detail of the regex crate" and "may experience breaking changes at [an unpredictable] cadence". (See https://github.com/rust-lang/regex/tree/master/regex-syntax#motivation.) proptest is already committed to keeping up with those changes, but users of proptest should not need to grow a dependency on an (allegedly) unstable API for something this simple.
matthew-russo commented 1 year ago

Thanks for the follow up -- I'll give the PR a look some time this week.

And just for clarity, the intention wasn't to bury any requests, just if there was a satisfactory solution without adding new code, I'll generally opt for that. It's clear that the existing solution is not satisfactory

matthew-russo commented 1 year ago

Closed with merge of: https://github.com/proptest-rs/proptest/pull/337