Closed 8573 closed 9 months ago
Thanks for reporting this. I didn't know about the Miri tool, it is useful!
Please note that the first example, as well as many other examples, uses sentinel method to check for the end of input, which requires that the string is non-empty: it must have at least one character, the sentinel, or the method won't work at all. Sentinel method is just one of the ways to check for the end of input (the fastest and simplest one, and best suited for C-style null-terminated strings). But if I try Miri on an example that uses bounds checking method (taken from here), there is no error.
So, the true way to solve this would be to express the contract that the input string contains the sentinel character. I don't know whether it's possible in Rust.
If speed is not an issue, one can use --no-unsafe
option and simple indexing s[cursor]
. Or it can be used in debug builds, leaving *s.get_unchecked(cursor)
for release builds. All the examples are tested in both modes.
You could use iter as well like this (YYPEEK
is an expression and you can define it in any suitable way):
fn lex(s: &[u8]) -> bool {
let mut cursor = s.iter().peekable();
/*!re2c
re2c:define:YYCTYPE = u8;
re2c:define:YYPEEK = "match cursor.peek() { Some(c) => **c, None => panic!(\"oh no!\") }";
re2c:define:YYSKIP = "cursor.next();";
re2c:yyfill:enable = 0;
number = [1-9][0-9]*;
number { return true; }
* { return false; }
*/
}
fn main() {
assert!(lex(b"1234\0"));
}
Compile it with --no-unsafe
option: re2rust --no-unsafe example.re -o example.rs
.
Or else instead of panicking you could return the sentinel symbol:
fn lex(s: &[u8]) -> bool {
let mut cursor = s.iter().peekable();
/*!re2c
re2c:define:YYCTYPE = u8;
re2c:define:YYPEEK = "match cursor.peek() { Some(c) => **c, None => 0 }";
re2c:define:YYSKIP = "cursor.next();";
re2c:yyfill:enable = 0;
number = [1-9][0-9]*;
number { return true; }
* { return false; }
*/
}
fn main() {
assert!(lex(b"1234\0"));
}
But it slows down the lexer to have bounds checks on every symbol. If performance is essential, either use sentinel method with a guarantee that the input is sentinel-terminated, or use bounds checks with padding.
So, the true way to solve this would be to express the contract that the input string contains the sentinel character. I don't know whether it's possible in Rust.
When the sentinel is '\0', this contract could be expressed by having the lex
function take &CStr
rather than &[u8]
.
Otherwise, the lex
function could either
if !s.contains(sentinel) { return Err(...); }
or assert!(s.contains(sentinel))
(contains
could be replaced with a function optimized for the needle (the sentinel) being near the end of the haystack (s
), like memrchr
), orunsafe fn
, start with debug_assert!(s.contains(sentinel))
, and document the contract:
/// Lexes a byte slice
///
/// # Safety
///
/// The input byte slice must contain the byte [...].
/// This function has undefined behavior otherwise.
unsafe fn lex(...) { ... }
... or use bounds checks with padding.
I see that this is the default method. I suppose the sentinel method is used in the opening example in the manual because it gives simpler code?
Given this, I think a more Rust-idiomatic approach to the opening example would be to use the simple sentinel method there, but with simple and safe x[y]
indexing and a note that the bounds-checks-with-padding method is the default and, in practice, for performance, one could use that instead. It would follow Rust expectations to use a simple but slower approach in an opening example for teaching the new user, and then to show how to do the same thing faster, and to prevent UB in every case.
It often is possible to find a solution that avoids unsafety and that the Rust compiler can optimize to be as fast as or faster than an unsafe solution (see, e.g., "How to avoid bounds checks in Rust (without unsafe!)"), but I don't feel experienced enough with this, so I am requesting advice from more experienced Rust people.
Just, aside, in the bounds check with padding code, I see the line
buf.extend(vec![0; YYMAXFILL]);
and this is just totally wasteful if you care about performance. It heap allocates a totally separate vec, zeroes the vec, then copies the zeroes to the buffer.
It would be hugely better to use an array (which doesn't by itself go on the heap), or alternately core::iter::repeat
combined with take
.
@8573 Forgot to reply to this part of your question:
Do the benchmarks cover re2rust?
No, not yet.
https://github.com/skvadrik/re2c/commit/08d414b8b284a64c3f2de149ae1b5425591625aa should fix this. I opted for simple indexing in the intro example, and assertions in the other ones, as it is a trivial constant-time operation to check that the input slice is sentinel-terminated.
Just, aside, in the bounds check with padding code, I see the line
buf.extend(vec![0; YYMAXFILL]);
and this is just totally wasteful if you care about performance. It heap allocates a totally separate vec, zeroes the vec, then copies the zeroes to the buffer.
It would be hugely better to use an array (which doesn't by itself go on the heap), or alternately
core::iter::repeat
combined withtake
.
Thanks, I fixed this in https://github.com/skvadrik/re2c/commit/08d414b8b284a64c3f2de149ae1b5425591625aa.
Closing, please reopen if you think there are still issues.
Many of the examples in the re2rust manual call the slice method
get_unchecked
on a slice that may be empty, which incurs undefined behavior (UB).For example, if I take the first example's generated code, copy it to the Rust Playground, change the example input bytestring to the empty bytestring, and run Miri (Tools > Miri) on it, the UB is quickly detected.
The simplest solution would be to change the various
x.get_unchecked(y)
tox[y]
, which is the same but with run-time safety checks, which may be optimized out. (Do the benchmarks cover re2rust?)A more idiomatic solution would be to use an iterator, such as that returned by the
iter
method, rather than integer indices.let mut cursor = 0
could change tolet mut cursor = s.iter().peekable()
,*s.get_unchecked(cursor)
tocursor.peek()
, andcursor += 1
tocursor.next()
, but there would need to be a way to handle theNone
value thatpeek()
returns at end of input. I have no prior familiarity with re2c/re2rust and I don't know how well this would fit into its expectations.