rust-bakery / nom

Rust parser combinator framework
MIT License
9.36k stars 802 forks source link

nom::branch::permutation does not support optional sub parsers #1153

Open JanLek opened 4 years ago

JanLek commented 4 years ago

nom::branch::permutation does not support optional sub parsers

Prerequisites

Description

I am writing a parser that uses several sub parsers, but the order in which they appear can change, so I thought I should use nom::branch::permutation. However, some of the sub-parsers are optional, and as far as I can tell, nom::branch::permutation does not have a way of marking some of the sub-parsers as optional.

The permutation macro does support this, but I am not sure if I can use it without also re-writing the sub-parsers with macros. Is there a way this functionality can be made available for nom::branch::permutation? Or maybe this is already possible and I've just missed something?

Test case

Apologies, the following test case is rather contrived. The permutation macro supports optional sub parsers, so this is valid:

use nom::{named, permutation, tag};

named!(test_parser_macro<&str, (Option<&str>, Option<&str>)>,
  permutation!(tag!("foo")?, tag!("bar")?)
);

let result = test_parser_macro("barfoo").expect("No errors in this test");

assert_eq!(("", (Some("foo"), Some("bar"))), result);

As far as I can tell, optional parsers are not supported with the permutation function. I did try it out, just in case, but indeed it does not work:

use nom::{
  branch::permutation,
  bytes::complete::tag,
  combinator::opt,
  IResult,
};

fn test_parser_function<'a>(input: &'a str) -> IResult<&str, (Option<&'a str>, Option<&'a str>)> {
  permutation((opt(tag("foo")), opt(tag("bar"))))(input)
}

let result = test_parser_function("barfoo").expect("No errors in this test");

assert_eq!(("", (Some("foo"), Some("bar"))), result); // Returns ("foo", (None, Some("bar")))
erihsu commented 3 years ago

Similar question.

RustyJoeM commented 3 years ago

same problem, it seems that parser succeeds/fails depending on order of input data of the permutation() children nodes...

e.g. inpt text:

    organization "ACME comp";
    contact "joe joe";
    description "blah blah";
    reference "that RFC over there";

is parsed correctly, but the "same" input with reordered items:

    organization "ACME comp";
    contact "joe joe";
    reference "that RFC over there";
    description "blah blah";

fails with error for parser:

permutation((
    opt(organization_stmt),
    opt(contact_stmt),
    opt(description_stmt),
    opt(reference_stmt),
))(input)

with error:

x = Err(
    Error(
        Error {
            input: "description \"blah blah\";\n}\n",
            code: Tag,
        },
    ),
)
Stargateur commented 3 years ago

permutation doesn't make sense for optional parser, it's only make sense for mandatory present and exclusive but not ordered thing. (so quite limited use case...)

I advice you to use many0 a alt a map and an enum for example:

many0(alt((
    map(organization_stmt, MyEnum::Organization),
    etc...,
));

One can also use a while loop instead or many_fold0 and use an hashmap to build the data structure that represent one possible item.

RustyJoeM commented 3 years ago

Ok thanks for confirmation.

Random order of optional single instance statements seems to be rather frequent pattern in telco/communications protocols and/or data format grammars, thus "making sense" in user point of view is not that irrational imho.

I realize though why it's not currently supported and what this means on the implementaion level -> it would need extra backtracking and/or some sort of heuristics/dynamic programming patterns to match various possible cases.

When opt statements are 0 to 1 instances, not "manys", workaround requires additonal count check or to assure that at most one instance per each enum variant exists.

I will need to cope with it and come up with some wrapper parser then to address the issue.

adamgreig commented 3 years ago

In case it's helpful, my workaround for this ended up being explicitly enumerating the possible permutations with help of a macro, since I only had to support four optional and in-any-order items:

https://github.com/adamgreig/svf/blob/10bde3d2602e0f05ffa3b2c2af749546145253e5/src/parser.rs#L401-L432

Stargateur commented 3 years ago

In case it's helpful, my workaround for this ended up being explicitly enumerating the possible permutations with help of a macro, since I only had to support four optional and in-any-order items:

https://github.com/adamgreig/svf/blob/10bde3d2602e0f05ffa3b2c2af749546145253e5/src/parser.rs#L401-L432

very inefficient but it's work

I realize though why it's not currently supported and what this means on the implementaion level -> it would need extra backtracking and/or some sort of heuristics/dynamic programming patterns to match various possible cases.

That simply impossible, how permutation would know when it's should retry, maybe a specialize permutation that "count" option and try to have the much "some" as possible, but again very not efficient

When opt statements are 0 to 1 instances, not "manys", workaround requires additonal count check or to assure that at most one instance per each enum variant exists. I will need to cope with it and come up with some wrapper parser then to address the issue.

That will be hard to make it simple, if I recall, I end up using some unsafe doing a enum manually using hashmap with enum as a key (some sort of discriminant), and a union as value to avoid any overhead, but I advice simply use enum and unwrap if you are not confident for unsafe. Specially for a first try.

Using a hashmap and check if any duplicate exist is *~O(1) so much better than permutation + alt solution that is O(n²) (or O(!n) not sure). But it's allocate, maybe if you know in advance the number of match expected you could use an array, but find the duplicate would be a little longer O(n), not important for small array, problem if you starting to make a giant array.

I guess solution like https://docs.rs/tinymap/0.2.4/tinymap/struct.TinyMap.html + enum should be simple enough.

RustyJoeM commented 3 years ago

In case it's helpful, my workaround for this ended up being explicitly enumerating the possible permutations with help of a macro, since I only had to support four optional and in-any-order items:

https://github.com/adamgreig/svf/blob/10bde3d2602e0f05ffa3b2c2af749546145253e5/src/parser.rs#L401-L432

thanks! yes, i also considered this approach for low count variations - it's nice and easy to implement (of course inneffective due to exhaustive attempts). Unfortunately my ABNF has 43 different "opt-perms" that have between 2 to about 12 mandatory/optional sub-statements in any order... 😱

That simply impossible, how permutation would know when it's should retry, maybe a specialize permutation that "count" option and try to have the much "some" as possible, but again very not efficient

due to nature of problem, i may need to live with lower efficiency and extra allocations etc. will see how the cumbersome peek/map approach goes. It looks like all/most of the cases of my ABNF do have unique leading keyword to allow peeking/parsing/map storing in loop rather directly.

Thanks to all for references and insights!

woile commented 1 year ago

I'm having a similar issue. I'm trying to parse apache's avro avdl namespace + alias, it can appear likes this:

@namespace("org.apache.avro.someOtherNamespace")
@aliases(["org.old.OldRecord", "org.ancient.AncientRecord"])

or

@aliases(["org.old.OldRecord", "org.ancient.AncientRecord"])
@namespace("org.apache.avro.someOtherNamespace")

but only 1 can appear as well, so the order doesn't matter, but only once, and they cannot be repeated. I think if I were to use many0 it would imply that they can appear more than once.

I'm gonna try some of the proposed solutions here and update.

woile commented 1 year ago

I started using https://crates.io/crates/nom_permutation from @Stargateur which seems to work fine for my use case.

Thanks