kschiess / parslet

A small PEG based parser library. See the Hacking page in the Wiki as well.
kschiess.github.com/parslet
MIT License
805 stars 95 forks source link

Tricks for brace (un)matching? #212

Closed dmolesUC closed 3 years ago

dmolesUC commented 3 years ago

I'm trying to write a parser for the MARCSpec grammar, and I'm running into trouble with this bit:

VCHAR             =  %x21-7E
                    ; visible (printing) characters          <-- note that this includes "}"
…
comparisonString  = "\" *VCHAR
…
subTerm           = fieldSpec / subfieldSpec / indicatorSpec / comparisonString / abbreviation
subTermSet        = [ [subTerm] operator ] subTerm
subSpec           = "{" subTermSet *( "|" subTermSet ) "}"   <-- using "}" as a delimiter

Relevant parts of my code:

rule(:vchar) { match['\u0021-\u007e'] }
…
rule(:comparison_string) { str('\\') >> vchar.repeat.as(:value) }
…
rule(:sub_term) { field_spec | subfield_spec | indicator_spec | comparison_string | abbreviation }
rule(:sub_term_set) { (sub_term.maybe >> operator).maybe >> sub_term }
rule(:sub_spec) { str('{') >> (sub_term_set >> (str('|') >> sub_term_set).repeat) >> str('}')}

I can parse a subTermSet fine, but not a subSpec, e.g. \A but not {\A}. I think what's happening is that my parser for comparisonString sees the trailing }, not unreasonably, as part of its own value, so the subSpec parser runs out of characters. If I simplify my subSpec rule down to:

# subSpec           = "{" comparisonString "}" 
rule(:sub_spec) { str('{') >> comparison_string >> str('}') }

I still get this failure:

expected SUB_SPEC to be able to parse "{\\A}"
Failed to match sequence ('{' COMPARISON_STRING '}') at line 1 char 5.
`- Premature end of input at line 1 char 5.

Is there any way to get around this, or am I running up against some inherent PEG parser limitation?

dmolesUC commented 3 years ago

My workaround for now is to disallow } in comparison_string values, but it would be nice not to have to do that. I looked at the balanced parentheses example, but that's not quite what I'm looking for.

dmolesUC commented 3 years ago

Coming back to this I realized that for this particular use case there is an escape mechanism that's not captured in the simple grammar — } and several other characters can only appear in the body of comparisonString if escaped by \. New solution:

# ASCII visible characters, except those that need to be escaped
rule(:vchar_cs) { match['\u0021-\u007e&&[^!$=?{|}~]'] }

# ASCII visible characters that need to be escaped
rule(:vchar_cs_esc) { match['!$=?{|}~'] }

rule(:comparison_string) do
  (vchar_cs | vchar_cs_esc) >>
      (vchar_cs | (str('\\').ignore >> vchar_cs_esc)).repeat
end

rule(:_comparison_string) { str('\\').ignore >> comparison_string.as(:value) }