Examples in re2rust manual have undefined behavior

8573 commented 9 months ago

Many of the examples in the re2rust manual call the slice method get_unchecked on a slice that may be empty, which incurs undefined behavior (UB).

For example, if I take the first example's generated code, copy it to the Rust Playground, change the example input bytestring to the empty bytestring, and run Miri (Tools > Miri) on it, the UB is quickly detected.

The simplest solution would be to change the various x.get_unchecked(y) to x[y], which is the same but with run-time safety checks, which may be optimized out. (Do the benchmarks cover re2rust?)

A more idiomatic solution would be to use an iterator, such as that returned by the iter method, rather than integer indices. let mut cursor = 0 could change to let mut cursor = s.iter().peekable(), *s.get_unchecked(cursor) to cursor.peek(), and cursor += 1 to cursor.next(), but there would need to be a way to handle the None value that peek() returns at end of input. I have no prior familiarity with re2c/re2rust and I don't know how well this would fit into its expectations.

skvadrik commented 9 months ago

Thanks for reporting this. I didn't know about the Miri tool, it is useful!

Please note that the first example, as well as many other examples, uses sentinel method to check for the end of input, which requires that the string is non-empty: it must have at least one character, the sentinel, or the method won't work at all. Sentinel method is just one of the ways to check for the end of input (the fastest and simplest one, and best suited for C-style null-terminated strings). But if I try Miri on an example that uses bounds checking method (taken from here), there is no error.

So, the true way to solve this would be to express the contract that the input string contains the sentinel character. I don't know whether it's possible in Rust.

If speed is not an issue, one can use --no-unsafe option and simple indexing s[cursor]. Or it can be used in debug builds, leaving *s.get_unchecked(cursor) for release builds. All the examples are tested in both modes.

You could use iter as well like this (YYPEEK is an expression and you can define it in any suitable way):

fn lex(s: &[u8]) -> bool {
    let mut cursor = s.iter().peekable();
    /*!re2c
        re2c:define:YYCTYPE = u8;
        re2c:define:YYPEEK = "match cursor.peek() { Some(c) => **c, None => panic!(\"oh no!\") }";
        re2c:define:YYSKIP = "cursor.next();";
        re2c:yyfill:enable = 0;

        number = [1-9][0-9]*;

        number { return true; }
        *      { return false; }
    */
}

fn main() {
    assert!(lex(b"1234\0"));
}

Compile it with --no-unsafe option: re2rust --no-unsafe example.re -o example.rs.

skvadrik commented 9 months ago

Or else instead of panicking you could return the sentinel symbol:

fn lex(s: &[u8]) -> bool {
    let mut cursor = s.iter().peekable();
    /*!re2c
        re2c:define:YYCTYPE = u8;
        re2c:define:YYPEEK = "match cursor.peek() { Some(c) => **c, None => 0 }";
        re2c:define:YYSKIP = "cursor.next();";
        re2c:yyfill:enable = 0;

        number = [1-9][0-9]*;

        number { return true; }
        *      { return false; }
    */
}

fn main() {
    assert!(lex(b"1234\0"));
}

But it slows down the lexer to have bounds checks on every symbol. If performance is essential, either use sentinel method with a guarantee that the input is sentinel-terminated, or use bounds checks with padding.

8573 commented 9 months ago

So, the true way to solve this would be to express the contract that the input string contains the sentinel character. I don't know whether it's possible in Rust.

When the sentinel is '\0', this contract could be expressed by having the lex function take &CStr rather than &[u8].

Otherwise, the lex function could either

start with if !s.contains(sentinel) { return Err(...); } or assert!(s.contains(sentinel)) (contains could be replaced with a function optimized for the needle (the sentinel) being near the end of the haystack (s), like memrchr), or

be an unsafe fn, start with debug_assert!(s.contains(sentinel)), and document the contract:

/// Lexes a byte slice
///
/// # Safety
///
/// The input byte slice must contain the byte [...].
/// This function has undefined behavior otherwise.
unsafe fn lex(...) { ... }

8573 commented 9 months ago

... or use bounds checks with padding.

I see that this is the default method. I suppose the sentinel method is used in the opening example in the manual because it gives simpler code?

Given this, I think a more Rust-idiomatic approach to the opening example would be to use the simple sentinel method there, but with simple and safe x[y] indexing and a note that the bounds-checks-with-padding method is the default and, in practice, for performance, one could use that instead. It would follow Rust expectations to use a simple but slower approach in an opening example for teaching the new user, and then to show how to do the same thing faster, and to prevent UB in every case.

It often is possible to find a solution that avoids unsafety and that the Rust compiler can optimize to be as fast as or faster than an unsafe solution (see, e.g., "How to avoid bounds checks in Rust (without unsafe!)"), but I don't feel experienced enough with this, so I am requesting advice from more experienced Rust people.

Lokathor commented 9 months ago

Just, aside, in the bounds check with padding code, I see the line

buf.extend(vec![0; YYMAXFILL]);

and this is just totally wasteful if you care about performance. It heap allocates a totally separate vec, zeroes the vec, then copies the zeroes to the buffer.

It would be hugely better to use an array (which doesn't by itself go on the heap), or alternately core::iter::repeat combined with take.

skvadrik commented 9 months ago

@8573 Forgot to reply to this part of your question:

Do the benchmarks cover re2rust?

No, not yet.

skvadrik commented 9 months ago

https://github.com/skvadrik/re2c/commit/08d414b8b284a64c3f2de149ae1b5425591625aa should fix this. I opted for simple indexing in the intro example, and assertions in the other ones, as it is a trivial constant-time operation to check that the input slice is sentinel-terminated.

skvadrik commented 9 months ago

Just, aside, in the bounds check with padding code, I see the line
buf.extend(vec![0; YYMAXFILL]);
and this is just totally wasteful if you care about performance. It heap allocates a totally separate vec, zeroes the vec, then copies the zeroes to the buffer.

It would be hugely better to use an array (which doesn't by itself go on the heap), or alternately core::iter::repeat combined with take.

Thanks, I fixed this in https://github.com/skvadrik/re2c/commit/08d414b8b284a64c3f2de149ae1b5425591625aa.

skvadrik commented 9 months ago

Closing, please reopen if you think there are still issues.

skvadrik / re2c

Examples in re2rust manual have undefined behavior #471