Closed zackw closed 1 year ago
Hey sorry for the delay in getting to this. Gave this a look -- bytes_regex
doesn't allow this but there is a public method bytes_regex_parsed
that just takes in ®ex_syntax::hir::Hir
, allowing you to generate that yourself with a customized regex_syntax::ParserBuilder
which should address your needs. Given the workaround is straightforward and works generally for other regex parsing options, I'm inclined to leave things as is and not introduce a new, specific helper but I'm open to hearing arguments for that.
I'm going to close this but feel free to reopen if there are outstanding questions/discussions to be had
The proptest method: https://docs.rs/proptest/latest/proptest/string/fn.bytes_regex_parsed.html
A working example with your pattern:
╰─⠠⠵ cat Cargo.toml
[package]
name = "proptest-playground"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
proptest = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
proptest-derive = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
proptest-state-machine = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
regex-syntax = { version = "0.7" }
╰─⠠⠵ cat src/main.rs
fn main() {
println!("Hello, world!");
}
#[cfg(test)]
mod test {
use proptest::prelude::*;
use regex_syntax::hir::Hir;
use regex_syntax::{ParserBuilder, Error};
fn regex_to_hir(pattern: &str) -> Result<Hir, Error> {
let mut parser = ParserBuilder::new().utf8(false).build();
Ok(parser.parse(pattern)?)
}
fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
let pattern = r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)";
let hir = regex_to_hir(pattern).unwrap();
prop::string::bytes_regex_parsed(&hir).unwrap()
}
proptest!(
#![proptest_config(ProptestConfig::with_cases(3))]
#[test]
fn simple(bytes in invalid_ts()) {
println!("{:?}", bytes);
}
);
}
╰─⠠⠵ cargo test -- --nocapture
Compiling proptest-playground v0.1.0 (/Users/matthewrusso/projects/user/proptest-playground)
Finished test [unoptimized + debuginfo] target(s) in 0.23s
Running unittests src/main.rs (target/debug/deps/proptest_playground-e32ba8f312713645)
running 1 test
[57, 57, 56, 50, 53, 49, 57, 54, 48, 48, 49, 50, 54, 48, 48, 49, 55, 56, 52, 51, 56, 57, 51, 52, 55, 50, 52, 47, 188, 55, 82, 114, 39, 118, 18, 8, 205, 202, 3, 115, 54, 181, 186, 253, 21, 165, 164, 197, 71, 172, 43, 208, 128, 183]
[54, 54, 56, 57, 55, 53, 55, 48, 52, 51, 48, 52, 51, 56, 50, 51, 55, 56, 51, 53, 49, 56, 51, 56, 46, 53, 55, 35, 35, 11, 196, 222, 134, 231, 202, 233, 40, 95, 229, 91, 72, 4, 119, 112, 41, 25, 251, 237, 184, 36, 157, 51, 128, 215]
[57, 53, 57, 51, 55, 51, 49, 57, 52, 56, 48, 56, 56, 56, 49, 49, 55, 49, 50, 49, 56, 51, 55, 57, 49, 54, 47, 248, 53, 205, 23, 39, 137, 171, 14, 133, 3, 102, 253, 209, 228, 29, 98]
[52, 53, 5, 201, 75, 108, 145, 227, 247, 55, 123]
test test::simple ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
I'm going to ask you to reconsider closing this, and instead to merge #337, for two reasons:
bytes_regex
already exists. Allowing people to use (?-u)
in bytes_regex
regexes makes it significantly more useful. I can't presently think of any situation where I would want to use bytes_regex
without (?-u)
, although that may be biased by the kind of tests I'm currently writing (e.g. https://git.sr.ht/~zackw/file-integrity-rs/tree/c8ec4a33d3b4d75648e7568587eeb7a0f4b6c529/item/file_integrity/src/test/attrs.rs#L106).regex_syntax
crate describe it as "an implementation detail of the regex
crate" and "may experience breaking changes at [an unpredictable] cadence". (See https://github.com/rust-lang/regex/tree/master/regex-syntax#motivation.) proptest
is already committed to keeping up with those changes, but users of proptest
should not need to grow a dependency on an (allegedly) unstable API for something this simple.Thanks for the follow up -- I'll give the PR a look some time this week.
And just for clarity, the intention wasn't to bury any requests, just if there was a satisfactory solution without adding new code, I'll generally opt for that. It's clear that the existing solution is not satisfactory
Closed with merge of: https://github.com/proptest-rs/proptest/pull/337
This strategy function
is intended to generate, among other things, invalid UTF-8 byte sequences, because it's for testing a parser that works directly from data on disk that cannot be trusted. What it actually does is crash on the
unwrap()
withLooking at the code, I believe the change needed is for
bytes_regex
to have its own version ofregex_to_hir
that callsParserBuilder::allow_invalid_utf8(true)
.This change should have no visible effect on any existing code that uses
bytes_regex
, since one must also opt into generation of invalid UTF-8 within the regex itself (that's one of the things the(?s-u:
does) and any existing regex that uses that flag must be using it in a way that actually can't generate invalid UTF-8, or else they'd hit the same crash I'm hitting.