rust-bakery / nom

Rust parser combinator framework
MIT License
9.42k stars 805 forks source link

Parsing just enough spaces to get to next part of the input #1185

Open Gaelan opened 4 years ago

Gaelan commented 4 years ago

I'm trying to parse something along the lines of this:

John Doe*         jdoe at gmail dot com
Bob D. Example    bob at example dot com
Long Named-Person longname123 at hotmail dot com

Note that some people have asterisks after their names (and I want to know if they do), and that there can only be one space after someone's name, so we can't just look for multiple spaces. I think the column width might be constant, but I'd rather not rely on it if I don't have to. I don't need to deal with names that look like email addresses.

With a regex, I could parse it with something like /(.*)+\*? +(.*) at (.*) dot (.*)/. However, I'm not sure how best do to this in Nom—it's easy enough to build the equivalent of (.*) at (.*) dot (.*) (let's call it email), but I think I need a combinator that goes until email matches, then goes back and parses the first part of the line (name, possible asterisk, then any number of spaces). Did I miss that combinator in the docs? Is there some other approach that I'm missing?

djeedai commented 3 years ago

I think the problem is that your format is loosely defined.

First, it seems your regex only works with Golang flavor; this online analyzer fails on all other variants, including JavaScript and PCRE flavors. I have a hard time understanding what you intend to mean by (.*)+ for example, since * is generally greedy.

Second, it seems that the format is in fact a fixed-length column one, and if you don't want to use that, then what is your condition to delimit the name from the email? Are you choosing to consider that all fields minus one before at form the name, expecting the email part before at is always a single word (which is reasonable since spaces are not allowed in emails)? If that's the case, I am no expert but have you tried terminated!( <the name part> , <the email part> )? It sounds to me like this could work, if I understand correctly how terminated! works it will try to match the first combinator while the second one fails, so in your case you'd have to ensure the second one <email part> takes a single space-separated word before at, and therefore fails if there are more of them (because of the name parts), until there's indeed only one left (the email part). Something like (warning this is just pseudo-code):

terminated!(
  separated_list!(
    sp,
    alphanumeric1
  ),
  do_parse!(
    email: alphanumeric1 >>
    sp >>
    tag!("at")
    sp >>
    domain: alphanumeric1 >>
    sp >>
    tag!("dot")
    sp >>
    ext: alphanumeric1
  )
)
djeedai commented 3 years ago

Note that this won't work because separated_list!() is greedy, but you get the idea.

montmorill commented 1 month ago

How can I have a non-greedy verison of a parser such as many0? It been 4years since the issues opened and I wander if there is a solution in nowadays nom7.