rust-bakery / nom

Rust parser combinator framework
MIT License
9.18k stars 792 forks source link

nom::bytes::complete::escaped_transform woes? #1679

Open kitchen opened 11 months ago

kitchen commented 11 months ago

I'm trying to use nom::bytes::complete::escaped_transform and running into some trouble.

Specifically, I'm running into an issue where the function wants and escape char but I am trying to give it an escape byte, one that doesn't seem to be playing nicely with as char (specifically, 0xDB)

It seems as though in rust, a char is actually a multi-byte representation of a unicode character. And if I'm understanding things correctly 0xDB is above decimal 127, which means the "there's another byte to this character" utf-8 encoding thing so it's more like 0xDB00 internally? Now that I think of that, I actually wrote a little test case to check for that and sure enough that's exactly what it is.

Anywho, this possibly raises a bigger issue: this function maybe should be in nom::character::complete instead of bytes since it's clearly character oriented? And then a byte-oriented version placed in nom::bytes::complete? Also I wonder how hard it would be to have the escape char argument be another parser, so you could use tag or something else in place (not that I need that, but it might be useful to make it more generic?)

Thanks!

Prerequisites

❯ rustc --version
rustc 1.71.0 (8ede3aae2 2023-07-12)

❯ grep nom Cargo.toml
nom = "7.1.3"

Test case

use nom::branch::alt;
use nom::bytes::complete::{escaped_transform, is_not, tag};
use nom::combinator::value;
use nom::IResult;

const FEND: u8 = 0xC0;
const FESC: u8 = 0xDB;
const TFEND: u8 = 0xDC;
const TFESC: u8 = 0xDD;

pub fn unescape(input: &[u8]) -> IResult<&[u8], Vec<u8>> {
    escaped_transform(
        is_not([FESC]),
        FESC as char,
        alt((
            value(&[FEND][..], tag(&[TFEND])),
            value(&[FESC][..], tag(&[TFESC])),
        )),
    )(input)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn try_fesc() {
        let res = unescape(&[0x61, 0x62, FESC, TFEND, 0x63, 0x64, 0x65]);
        assert_eq!(res, Ok((&[][..], vec![0x61, 0x62, FEND, 0x63, 0x64, 0x65])))
    }

    #[test]
    fn try_fesczerozero() {
        // 0xDB as char internally gets turned into 0xDB00, it seems
        // this test case is *not* desired behavior, but I put it here
        // for insight into the implementation details
        let res = unescape(&[0x61, FESC, 0x00, TFEND, 0x63, 0x64]);
        assert_eq!(res, Ok((&[][..], vec![0x61, FEND, 0x63, 0x64])));
    }

    #[test]
    fn try_noesc() {
        let res = unescape(&[0x61, 0x62, 0x63]);
        assert_eq!(res, Ok((&[][..], vec![0x61, 0x62, 0x63])));
    }
}

output of test run:

❯ cargo test
    Finished test [unoptimized + debuginfo] target(s) in 0.00s
     Running unittests src/lib.rs (target/debug/deps/nomplayground-ec796cae7e096d2e)

running 3 tests
test tests::try_noesc ... ok
test tests::try_fesczerozero ... ok
test tests::try_fesc ... FAILED

failures:

---- tests::try_fesc stdout ----
thread 'tests::try_fesc' panicked at 'assertion failed: `(left == right)`
  left: `Err(Error(Error { input: [99, 100, 101], code: Tag }))`,
 right: `Ok(([], [97, 98, 192, 99, 100, 101]))`', src/lib.rs:29:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    tests::try_fesc

test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

error: test failed, to rerun pass `--lib`
Geal commented 11 months ago

right it looks like it's missing something when looking at utf8 input