rust-bakery / nom

Rust parser combinator framework
MIT License
9.47k stars 804 forks source link

Looking for pointers to documentation on how to resolve the deprecated ws! macro warnings #1059

Open shnewto opened 5 years ago

shnewto commented 5 years ago

👋 Hello!

Raising an issue to mostly ask for guidance. I've got a crate that's using nom 3.2 functionality for the most part and I'd like to update it to properly use nom 5 (it's just technically compatible with nom 5 atm). I'm particularly motivated by a lot of warnings about the deprecation of the ws! macro. But I'm having trouble locating documentation specific to converting existing / outdated macros to nom 5 functions and what I can use to replace the ws! macro. Any pointers to docs on that sort of thing would be really appreciated.

For context, the crate is called bnf. Here's an example of the kind of thing I'm looking to update:

named!(pub prod_lhs< &[u8], Term >,
    do_parse!(
            nt: delimited!(char!('<'), take_until!(">"), ws!(char!('>'))) >>
            _ret: ws!(tag!("::=")) >>
            (Term::Nonterminal(String::from_utf8_lossy(nt).into_owned()))
    )
);

Thanks!

djc commented 5 years ago

I would also be very interested in this for Askama!

shnewto commented 5 years ago

For anyone interested, I worked my way through upgrading bnf, the changes are here. Here's what that logic above ended up looking like:

pub fn prod_lhs<'a>(input: &'a str) -> IResult<&'a str, Term, VerboseError<&'a str>> {
    let (input, nt) = delimited(
        complete::char('<'),
        take_until(">"),
        terminated(complete::char('>'), complete::multispace0),
    )(input)?;

    let (input, _) = preceded(
        complete::multispace0,
        terminated(tag("::="), complete::multispace0),
    )(input)?;

    Ok((input, Term::Nonterminal(nt.to_string())))
}

As far as that ws! macro, my solution ended up pretty verbose. What was previously ws!(tag!("::=")) ended up as:

preceded(
        complete::multispace0,
        terminated(tag("::="), complete::multispace0)

(Watch out for the difference between multispace0 and multispace1, zero or more whitespace characters vs 1 or more whitespace characters.)

One thing I bumped into was that it looks like there's no function equivalent to the eof! macro so I ended up implementing my own as eoi (end of input).

pub fn eoi<I: Copy + InputLength, E: ParseError<I>>(input: I) -> IResult<I, I, E> {
    if input.input_len() == 0 {
        Ok((input, input))
    } else {
        Err(nom::Err::Error(E::from_error_kind(input, ErrorKind::Eof)))
    }
}

Certainly interested in any feedback on what I've presented here if anything seems off.

Timmmm commented 5 years ago

The JSON and S-expression examples both explicitly add whitespace parsers everywhere. :-/

djc commented 5 years ago

I ended up writing this thing, I'm not sure why it's not included in nom:

fn ws<F, I, O, E>(inner: F) -> impl Fn(I) -> IResult<I, O, E>
where
    F: Fn(I) -> IResult<I, O, E>,
    I: InputTake + Clone + PartialEq + for<'a> Compare<&'a [u8; 1]>,
    E: ParseError<I>,
{
    move |i: I| {
        let i = alt::<_, _, (), _>((tag(b" "), tag(b"\t")))(i.clone())
            .map(|(i, _)| i)
            .unwrap_or(i);
        let (i, res) = inner(i)?;
        let i = alt::<_, _, (), _>((tag(b" "), tag(b"\t")))(i.clone())
            .map(|(i, _)| i)
            .unwrap_or(i);
        Ok((i, res))
    }
}

It can probably be simplified/improved somehow...

Timmmm commented 4 years ago

Don't you need a repeat in there somewhere? Probably you can make it simpler like this:

delimited(multispace0, inner, multispace0)

Throw some punctuation soup over that.

Also does it lead to problems allowing whitespace before and after every parser? I mean if you have foo bar it now has an ambiguous parse:

Maybe that doesn't matter. For now I've ended up using tuples everywhere....

map(
   tuple(
      tag("struct"),
      ws,
      identifier,
      ws,
      tag("{")
      ws,
      ...
    ),
    |_, _, name, _, _, ....| {
      Ast::Struct {
         name,
         ...
      }
    )

It's not particularly pretty but I don't think it is too bad either. I also have a rule that all items allow trailing whitespace, but not preceding, and then I have a single preceding ws for the entire file. That voids duplicate ws parsers.

My plan to handle comments is to put that logic all in ws, so it consumes all whitespace and comments.

Timmmm commented 4 years ago

Here is my whitespace parser that supports //-style comments.

fn ws<'a>(input: &'a str) -> IResult<&'a str, &'a str, VerboseError<&'a str>> {
    let mut in_comment = false;

    let mut chars = input.char_indices();

    while let Some((i, c)) = chars.next() {
        if in_comment {
            if c == '\n' {
                in_comment = false;
            }
        } else {
            match c {
                ' ' | '\t' | '\r' | '\n' => {},
                '/' => {
                    // Might be a comment - check the next character.
                    match chars.next() {
                        Some((_, '/')) => {
                            in_comment = true;
                        },
                        _ => {
                            return Ok((&input[i..], &input[..i]))
                        }
                    }
                },
                _ => {
                    return Ok((&input[i..], &input[..i]))
                }
            }
        }
    }

    Ok((&input[0..0], &input))
}

fn ws1<'a>(i: &'a str) -> IResult<&'a str, &'a str, VerboseError<&'a str>> {
    match ws(i) {
        Ok((_, whitespace)) if whitespace.is_empty() => {
            Err(Err::Error(VerboseError::from_char(i, ' ')))
        },
        err => err,
    }
}

And here's how I use it to parse a struct (in a custom language):

fn parse_struct<'a>(i: &'a str) -> IResult<&'a str, Item<'a>, VerboseError<&'a str>> {
    map(
        context(
            "parsing struct",
            preceded(
                tag("struct"),
                cut(
                    tuple((
                        ws1,
                        identifier,
                        ws,
                        tag("{"),
                        many0(
                            preceded(
                                ws,
                                field,
                            ),
                        ),
                        ws,
                        tag("}"),
                    )),
                ),
            ),
        ),
        |(_, name, _, _, fields, _, _)| {
            Item::Struct {
                name,
                fields,
            }
        },
    )(i)
}

By the way, &input[0..0] seems a bit silly to me but I'm not enough of a Rust pro to know if there is something better (and technically it should be &input[input.len()..] or something like that).

Timmmm commented 4 years ago

Also note that it doesn't differentiate between whitespace and comments. If you actually want to use the comments (e.g. for syntax highlighting, or code generation) then you'll need something like this:

preceded(
  multispace0,
  many0(
    terminated(
      comment,
      multispace0,
    )
  )
)

Where comment is

preceded(
  tag("//"),
  take_while(|c| c != '\n'),
)

That may be simpler anyway. I haven't tested it though.

proehlen commented 4 years ago

I'm a rust noob but the below seems to be working for me. I'm sharing it since this thread amongst other places helped me to get to it and it might be useful for others. The terminated(many0(), whitespace) function call discards trailing whitespace for the whole input (ie spaces at the end of the line). The preceeded(whitespace, alt()) call discards leading whitespace before each individual token.

  let tokenizer =
    all_consuming(
      complete(
        terminated(
          many0(
            preceded(
              whitespace,
              alt((
                gstring,
                integer,
                float,
                operator,
                keyword,
              ))
            )
          ),
          whitespace,
        )
      )
    );

In my application, the input is all u8 bytes and the only whitespace is the space character so this is my whitespace function:

fn whitespace(input: &[u8]) -> IResult<&[u8], &[u8]> { 
  take_while(|b| b == ' ' as u8)(input) 
} 
djc commented 4 years ago

FWIW, in Askama I ended up with this thing:

fn ws<F, I, O, E>(inner: F) -> impl Fn(I) -> IResult<I, O, E>
where
    F: Fn(I) -> IResult<I, O, E>,
    I: InputTake + Clone + PartialEq + for<'a> Compare<&'a [u8; 1]>,
    E: ParseError<I>,
{
    move |i: I| {
        let ws = many0(alt::<_, _, (), _>((
            tag(b" "),
            tag(b"\t"),
            tag(b"\r"),
            tag(b"\n"),
        )));
        let i = ws(i.clone()).map(|(i, _)| i).unwrap_or(i);
        let (i, res) = inner(i)?;
        let i = ws(i.clone()).map(|(i, _)| i).unwrap_or(i);
        Ok((i, res))
    }
}

(It could perhaps alternatively take a slice for the whitespace characters.)