Permit use of (?-u) in byte-regex strategies (#336)

zackw commented 1 year ago

It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex

[0-9]+(\.[0-9]*)?

Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex

(?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)

However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 as well as valid UTF-8 but not within the accepted language. Then you want this slight variation:

(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)

But this regex will be rejected by bytes_regex, because by default regex_syntax::Parser errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, for bytes_regex only.

There should be no change to the behavior of any existing test suite, because opting to allow use of (?-u) does not change the semantics of any regex that doesn’t contain (?-u), and any existing regex that does contain (?-u) must be incapable of generating invalid UTF-8 for other reasons, or regex_syntax::Parser would be rejecting it. (For example, (?-u:[a-z]) cannot generate invalid UTF-8.)

This patch also adds a bunch of tests for bytes_regex, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions — do_test and do_test_bytes are almost identical, as are generate_values_matching_regex and generate_byte_values_matching_regex. I am not good enough at generic metaprogramming in Rust to factor out the duplication.

cameron1024 commented 1 year ago

Thanks for the PR 😅 I'll have a look at this more closely when I have time, but upon first inspection it looks sensible.

If I haven't reviewed by the end of this weekend feel free to ping me 😅

matthew-russo commented 1 year ago

Agh thanks for reminding me I need to look at this -- will do it this evening after work

zackw commented 1 year ago

Rebased on latest trunk, sorry for the delay.

zackw commented 1 year ago

Added another commit to fix up the code for an API change. I cannot run the test suite locally right now due to what looks like a bug in message-io...

error[E0716]: temporary value dropped while borrowed
   --> .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/message-io-0.17.0/src/adapters/udp.rs:241:22
    |
241 |                 &mut [io::IoSliceMut::new(&mut input_buffer)],
    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ creates a temporary value which is freed while still in use
...
244 |             );
    |              - temporary value is freed at the end of this statement
245 |
246 |             match result {
    |                   ------ borrow later used here
    |
    = note: consider using a `let` binding to create a longer lived value

... but this is sufficient to make proptest build again as a dependency of a larger project that uses the feature this PR adds, and that project's testsuite passes.

The additional commit should be squashed before merging.

zackw commented 1 year ago

Fatfingered the close button, sorry for the noise.

proptest-rs / proptest

Permit use of (?-u) in byte-regex strategies (#336) #337