maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.71k stars 105 forks source link

Undefined behaviour using `.` matcher on unicode #375

Closed bend-n closed 4 months ago

bend-n commented 4 months ago
use logos::Logos;

fn dec(tags: &str) {
    #[derive(logos::Logos, PartialEq, Debug)]
    #[logos(skip r"[\s\n]+")]
    enum Tokens<'s> {
        #[regex(r".", priority = 6)] // causes a split of bytes ee, a1, 93 to just ee
        String(&'s str),
    }
    let lexer = Tokens::lexer(&tags).map(Result::unwrap);
    for Tokens::String(x) in lexer {
        println!("{:?}", x.as_bytes());
        for c in x.chars() {} // << bad
    }
}

#[test]
fn dect() {
    dec("");
}

execution of this provides

running 1 test
test dect ... error: Undefined Behavior: entering unreachable code
  --> /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/validations.rs:49:23
   |
49 |     let y = unsafe { *bytes.next().unwrap_unchecked() };
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ entering unreachable code
   |
   = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
   = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
   = note: BACKTRACE:
   = note: inside `core::str::validations::next_code_point::<'_, std::slice::Iter<'_, u8>>` at /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/validations.rs:49:23: 49:54
   = note: inside `<std::str::Chars<'_> as std::iter::Iterator>::next` at /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/iter.rs:45:18: 45:49
note: inside `dec`
  --> src/main.rs:13:18
   |
13 |         for c in x.chars() {} // << bad
   |                  ^^^^^^^^^
note: inside `dect`
  --> src/main.rs:19:5
   |
19 |     dec("");
   |     ^^^^^^^^
note: inside closure
  --> src/main.rs:18:10
   |
17 | #[test]
   | ------- in this procedural macro expansion
18 | fn dect() {
   |          ^
   = note: this error originates in the attribute macro `test` (in Nightly builds, run with -Z macro-backtrace for more info)

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to 1 previous error; 1 warning emitted

error: test failed, to rerun pass `--bin logo`

which is occuring as logos is providing a string with bytes 0xee, which is clearly invalid utf8, and logos is committing library-ub by slicing a unicode bound.

RustyYato commented 4 months ago

I was debugging a similar issue recently, and I found the problem.

https://github.com/maciejhirsz/logos/blob/ba69cc3d811eb9b51da056da51c9425f910ad3c5/logos-codegen/src/graph/regex.rs#L163-L182

These two checks assert that a max size range always should take the ASCII fast path, but this is wildly incorrect for non-ASCII text. With a local clone that correctly checks that end < 128 only, fixes the issue.