rust-bakery / nom

Rust parser combinator framework
MIT License
9.18k stars 792 forks source link

`take_while_m_n` is invalid for multi-byte UTF-8 characters #1630

Closed epage closed 1 year ago

epage commented 1 year ago

take_m_n does:

    match input.position(|c| !cond(c)) {
      Some(idx) => {
        if idx >= m {
          if idx <= n {
            let res: IResult<_, _, Error> = if let Ok(index) = input.slice_index(idx) {
              Ok(input.take_split(index))
            } else {
              Err(Err::Error(Error::from_error_kind(
                input,
                ErrorKind::TakeWhileMN,
              )))
            };

while Input::position is defined as

  /// Returns the byte position of the first element satisfying the predicate
  fn position<P>(&self, predicate: P) -> Option<usize>
  where
    P: Fn(Self::Item) -> bool;

So it is looking up the byte position of the first non-matching character but then treats it as the Item count, comparing it with m and n and then using slice_index to convert it to a byte position. This should cause it to hit m and n sooner than it should and split at the wrong position.

I see that #913 was previously reported but the 7483885 and #1097 just added the "Item count to byte position" conversion but didn't address that it was using a byte position.

The reason the test from #913 is working is the else-clause for when there are more valid elements than n (4 is greater than 1), it caps it by n (1), making the Item count valid that it passes to slice_index.

To show this failing, we need to change m, n, and the input slightly

  #[test]
  fn take_while_m_n_utf8() {
    named!(parser<&str, &str>, take_while_m_n!(1, 2, |c| c == 'A' || c == '😃'));
    assert_eq!(parser("A!"), Ok(("!", "A")));
    assert_eq!(parser("😃!"), Ok(("!", "😃")));
  }

This fails, with the left-hand side reporting Ok(("", "😃!"). The only reason it doesn't crash is that slice_index, when exhausted, returns self.len().

I believe this problem predates #1612