Closed xadhoom closed 2 years ago
Another even minimal example:
Given this grammar:
defmodule SimpleParse do
use AbnfParsec,
debug: true,
abnf: """
number = DIGIT / ( LEADDIGIT 1*DIGIT )
fixednumber = 3DIGIT
LEADDIGIT = %x31-39 ; 1-9
"""
end
Output is:
iex(3)> SimpleParser.fixednumber("123")
{:ok, [fixednumber: '123'], "", %{}, {1, 0}, 3}
iex(4)> SimpleParser.number("123")
{:ok, [number: '1'], "23", %{}, {1, 0}, 1}
Is that correct? I expect no rest in the number/1
call, or not?
After investigating the generated NimbeParsec definitions, seems a greediness issue.
The choice
operator matches on the first item, without being greedy and exits.
So, this works:
defparsec(
:number,
tag(
choice([
parsec(:leaddigit) |> times(parsec(:core_digit), min: 1),
parsec(:core_digit)
]),
:number
)
)
defparsec(:core_digit, tag(ascii_char([48..57]), :core_digit))
defparsec(:leaddigit, tag(ascii_char([49..57]), :leaddigit))
Note the number
definition, which have been generated by the following grammar:
number = ( LEADDIGIT 1*DIGIT ) / DIGIT
LEADDIGIT = %x31-39 ; 1-9
So, what's the correct approach? ABNF grammars tend to have the less specific match first, but this means that the library should be as greedy as possible. Or is better to "fix" the RFC grammar to put longer matches first?
(deleted my last comment, seems that lookahead/2
is not helpful, I was looking to the wrong thing sorry)
There doesn't seem to be an easy way to solve this, with code.
It's quite unfortunate that the ABNF spec is ambiguous about the order of the alternatives.
https://stackoverflow.com/a/54951921/1125494
Some RFCs clarify this explicitly, with for example IMAPv4's RFC3501 including specification of PEG-like behaviour in RFC 3501 section 9:
In the case of alternative or optional rules in which a later rule overlaps an earlier rule, the rule which is listed earlier MUST take priority. For example, "\Seen" when parsed as a flag is the \Seen flag name and not a flag-extension, even though "\Seen" can be parsed as a flag-extension. Some, but not all, instances of this rule are noted below.
I don't know how common such disambiguation (hah) is, though. Many other RFCs I've looked at (I've been implementing an ABNF parser library in recent days) just leave it unspecified. Many RFC ABNF grammars are unambiguous (e.g. RFC8259 (JSON)); however, many are ambiguous (e.g. RFC5322 (Internet Messages)) and require fixups to work with an ambiguity-preserving parser :-(
There is no good way to determine which path matches the most content, unless we brute force iterate over all choices.
NimbleParsec
doesn't really backtrack, it only does it when for example within a choice
. I tried adding a parsec function that does this parsec(:oid) |> eos()
and it simply fails.
Making the grammar more specific is the best thing I can think of. Or skip this rule generation and write your own, which is functionally equivalent.
Thanks for your feedback and explanation.
For my use case, I'll alter a bit the grammar since is trivial to do and I agree that's the best way.
Feel free to close this issue so :)
Cheers.
I was trying to parse a OID string, as defined in RFC3061, but have what seems an incomplete parsing.
Given the module
Parsing single digit OIDs seems ok:
But parsing OIDs with multidigit members seems incomplete:
Basically it stops matching at the first repetition in the
number
definition.I'm missing something maybe? I was trying to build up a parser for ldap string search filter definitions, but stopped here, and the OID example is the minimal one I was able to extract.