princemaple / abnf_parsec

ABNF in, parser out
https://hexdocs.pm/abnf_parsec/AbnfParsec.html
MIT License
48 stars 2 forks source link

Incomplete parsing of repetitions #30

Closed xadhoom closed 2 years ago

xadhoom commented 2 years ago

I was trying to parse a OID string, as defined in RFC3061, but have what seems an incomplete parsing.

Given the module

defmodule OidParser do
  @moduledoc """
  OID parser based on rfc3061 abnf
  """

  use AbnfParsec,
    parse: :oid,
    abnf: """
    oid             = number *( DOT number )
    number          = DIGIT / ( LEADDIGIT 1*DIGIT )
    LEADDIGIT       = %x31-39 ; 1-9
    DIGIT           = %x30 / LEADDIGIT ; 0-9
    DOT             = %x2E ; period
    """
end

Parsing single digit OIDs seems ok:

iex(2)> OidParser.oid("1.9.0.3.4")                   
{:ok,
 [
   oid: [
     number: '1',
     dot: '.',
     number: '9',
     dot: '.',
     number: '0',
     dot: '.',
     number: '3',
     dot: '.',
     number: '4'
   ]
 ], "", %{}, {1, 0}, 9}

But parsing OIDs with multidigit members seems incomplete:

iex(1)> OidParser.oid("1.9.0.23.2342.19200300.100.4")
{:ok,
 [
   oid: [
     number: '1',
     dot: '.',
     number: '9',
     dot: '.',
     number: '0',
     dot: '.',
     number: '2'
   ]
 ], "3.2342.19200300.100.4", %{}, {1, 0}, 7}

Basically it stops matching at the first repetition in the number definition.

I'm missing something maybe? I was trying to build up a parser for ldap string search filter definitions, but stopped here, and the OID example is the minimal one I was able to extract.

xadhoom commented 2 years ago

Another even minimal example:

Given this grammar:

defmodule SimpleParse do
  use AbnfParsec,
    debug: true,
    abnf: """
    number          = DIGIT / ( LEADDIGIT 1*DIGIT )
    fixednumber     = 3DIGIT
    LEADDIGIT       = %x31-39 ; 1-9
    """
end

Output is:

iex(3)> SimpleParser.fixednumber("123")
{:ok, [fixednumber: '123'], "", %{}, {1, 0}, 3}
iex(4)> SimpleParser.number("123")     
{:ok, [number: '1'], "23", %{}, {1, 0}, 1}

Is that correct? I expect no rest in the number/1 call, or not?

xadhoom commented 2 years ago

After investigating the generated NimbeParsec definitions, seems a greediness issue. The choice operator matches on the first item, without being greedy and exits.

So, this works:

  defparsec(
    :number,
    tag(
      choice([
        parsec(:leaddigit) |> times(parsec(:core_digit), min: 1),
        parsec(:core_digit)
      ]),
      :number
    )
  )

  defparsec(:core_digit, tag(ascii_char([48..57]), :core_digit))
  defparsec(:leaddigit, tag(ascii_char([49..57]), :leaddigit))

Note the number definition, which have been generated by the following grammar:

    number          = ( LEADDIGIT 1*DIGIT ) / DIGIT 
    LEADDIGIT       = %x31-39 ; 1-9

So, what's the correct approach? ABNF grammars tend to have the less specific match first, but this means that the library should be as greedy as possible. Or is better to "fix" the RFC grammar to put longer matches first?

xadhoom commented 2 years ago

(deleted my last comment, seems that lookahead/2 is not helpful, I was looking to the wrong thing sorry)

princemaple commented 2 years ago

There doesn't seem to be an easy way to solve this, with code.

It's quite unfortunate that the ABNF spec is ambiguous about the order of the alternatives.

https://stackoverflow.com/a/54951921/1125494

Some RFCs clarify this explicitly, with for example IMAPv4's RFC3501 including specification of PEG-like behaviour in RFC 3501 section 9:

In the case of alternative or optional rules in which a later rule overlaps an earlier rule, the rule which is listed earlier MUST take priority. For example, "\Seen" when parsed as a flag is the \Seen flag name and not a flag-extension, even though "\Seen" can be parsed as a flag-extension. Some, but not all, instances of this rule are noted below.

I don't know how common such disambiguation (hah) is, though. Many other RFCs I've looked at (I've been implementing an ABNF parser library in recent days) just leave it unspecified. Many RFC ABNF grammars are unambiguous (e.g. RFC8259 (JSON)); however, many are ambiguous (e.g. RFC5322 (Internet Messages)) and require fixups to work with an ambiguity-preserving parser :-(

There is no good way to determine which path matches the most content, unless we brute force iterate over all choices.

NimbleParsec doesn't really backtrack, it only does it when for example within a choice. I tried adding a parsec function that does this parsec(:oid) |> eos() and it simply fails.

Making the grammar more specific is the best thing I can think of. Or skip this rule generation and write your own, which is functionally equivalent.

xadhoom commented 2 years ago

Thanks for your feedback and explanation.

For my use case, I'll alter a bit the grammar since is trivial to do and I agree that's the best way.

Feel free to close this issue so :)

princemaple commented 2 years ago

Cheers.