sirthias / parboiled2

A macro-based PEG parser generator for Scala 2.10+
Other
717 stars 86 forks source link

Support for manual `cut` markers #102

Closed sirthias closed 9 years ago

sirthias commented 10 years ago

This ticket is a weakened version of https://github.com/sirthias/parboiled/issues/19 Automatic insertion of cut markers based on static rule tree analysis is not currently possible in pb2, but we can support manual cut markers.

lihaoyi commented 10 years ago

Manual cut markers would be fine. I want this more for error reporting and semantics rather than for performance: manually inserting cuts in a few places to prevent backtracking means you can get very-very-very localized error messages ("expected :, didn't see it"), rather than backtracking out to the widest possible | clause and saying "none of these parsed".

It's also entirely possible I want to insert cut markers in places which change the semantics of the parser, which I think is perfectly fine: the behavior is well defined and relatively easy to understand.

sirthias commented 10 years ago

Interesting. Could you show a small simple example demonstrating in what way you'd expect a manual cut marker to improve the error message? parboiled is already pretty good in producing meaningful and very specific error messages, currently I don't see how a manual cut marker would improve things. But I haven't really thought this through completely, so I might very well be missing something...

lihaoyi commented 10 years ago

Let's say I have this rule

def R1 = rule{ "i am a cow" ~ " " ~ "moo!!!" ~ push("R1")}
def R2 = rule{ "i am a " ~ oneOrMore(CharPredicate.Alpha | " " | "!") ~ "gg" ~ push("R2")}
def R = rule{ R1 | R2 }

Now what should

new Parser("""i am a cow moo!!zzz""").R.run()

Give? currently it gives

Position(20,1,21)

Alpha, ' ', '!' or 'g'

4 rules mismatched at error location:
  R / R2 / oneOrMore / | / Alpha
  R / R2 / oneOrMore / | / " " / ' '
  R / R2 / oneOrMore / | / "!" / '!'
  R / R2 / "gg" / 'g'```

With a cut in the same location after "cow" to force the parser to commit:

def R1 = rule{ "i am a cow" ~ " " ~ cut ~ "moo!!!" ~ push("R1")}

It could give

Position(16,1,17)

'!'

1 rule mismatched at error location:
  R / R1 / "moo!!!" / '!'

Now this is a toy example, but it's the kind of thing where I think cut is useful. You have an ad-hoc grammar, and want to add ad-hoc commitment to not-backtrack at various parts of it, dictated by "business rules" more than by elegance-of-grammar. Maybe I want to know that after parsing "cow" I have to moo!!!, but for any other string I can do whatever I want. Not-unreasonable!

cut lets you very easily pick spots from which there's no backtracking, meaning that you can localize the error messages much more. Currently, AFAICT the behavior is, when it fails, to pick the longest successful parse and use that for the error message, which is kinda neat, but not necessarily the behavior I want all the time.

Hope this is useful!

sirthias commented 10 years ago

Interesting. I can definitely see the usefulness of a cut marker as another tool to steer the parser and maybe simplify the grammar definition by enabling looser rules in subsequent alternatives.

Thinking about it some more it might even make sense to introduce two different kinds of cut:

  1. A "soft cut", which only prevents trying subsequent alternatives in the next-higher choice operator but does allow them in even further up the rule stack.
  2. A "hard cut", which simply bails out completely.
lihaoyi commented 9 years ago

Having this feature would make working on https://github.com/lihaoyi/hands-on-scala-js/blob/master/scalaParser/src/main/scala/scalaParser/ScalaSyntax.scala way less painful. Suppose I'm trying to parse this file and there is a parse error somewhere inside:

diff --git a/scalaParser/src/test/resources/test.scala b/scalaParser/src/test/resources/test.scala
index d560736..b095c32 100644
--- a/scalaParser/src/test/resources/test.scala
+++ b/scalaParser/src/test/resources/test.scala
@@ -687,7 +687,7 @@ trait GenJSExports extends SubComponent { self: GenJSCode =>
         (toTypeKind(tpe): @unchecked) match {
           case VoidKind    => HijackedTypeTest(Defs.BoxedUnitClass,    0)
           case BooleanKind => HijackedTypeTest(Defs.BoxedBooleanClass, 1)
-          case ByteKind    => HijackedTypeTest(Defs.BoxedByteClass,    2)
+          case ByteKind    > HijackedTypeTest(Defs.BoxedByteClass,    2)
           case ShortKind   => HijackedTypeTest(Defs.BoxedShortClass,   3)
           case IntKind     => HijackedTypeTest(Defs.BoxedIntegerClass, 4)
           case FloatKind   => HijackedTypeTest(Defs.BoxedFloatClass,   5)

parboiled2 with its silly error-reporting-with-lots-of-retries strategy reports 1285 traces which takes 2.5 minutes, when a successful parse takes less than 1 second. I will not paste the traces but anyway I can't even C&P them properly because they run off the top of my console.

I'd say any reduction in the error-message-generation-time from 2.5 minutes can be considered an improvement. These files aren't even that big (26kb/762 lines). In fact, I now work only after removing the EOI from the end of the grammar, so that parboiled2 just succeeds with a partial parse and I can check the length to see if it's incomplete.

This also means I get zero error reporting, but at least the parser quickly tells me when it fails so I can start bisecting the file, which usually takes less time than 2.5 minutes! Anyway, the 1285 different traces aren't useful at all, and also run off the top of my console window so I can't Cmd-F to find things anymore, which is annoying. The fact that turning off error-reporting entirely makes for a better experience says a lot =P

Anyway, that's about "why the error reporting isn't very good". I'm just ranting, because I've spent a lot of time recently with parboiled2, and it's not often I see a feature that's trying so hard to be helpful but ends up being so harmful. Next, on to cut.


I don't really care whether the cut is a global cut or the nearest-ordered-choice cut. In MacroPEG I implemented global cut, and both would suffice for my use case, so whichever is easier to implement such that I can start using it sooner would win if it were up to me. Either implementation of cut would solve my problem entirely and make the error-reporting go from "worse than nonexistent" (see above) to "extremely excellent" (my experience with it) in a very short amount of time.

In the example above, having a cut operator (I don't care whether it's ~ cut ~ or ~!, but the latter would probably be easier since we'll want this a lot) would turn the error message into something like

Expected "=>" at line 24163, got ">"

That's it. No 1285 different very-long traces to dig through, no 2.5 minutes of waiting for the parser to just fail. No clever error-detection/parse-retries/analysis techniques. Just by adding cuts in the relevant places throughout the parser (in this case before the "=>" token). Furthermore, the behavior here is extremely easy to explain, whereas the current error reporting is so complex I don't understand after using it for the past two weeks, and asking about it on the mailing list. "There was a cut, and it didn't match after the cut, so it failed and didn't try anything else". 1 sentence that captures everything there is to know about how error reporting works, and how we can get such beautifully precise error messages.

Even if we wanted to keep the current error reporting behavior (I certainly don't; it's confusing/slow/unexpected) having cut would drastically reduce the search space of what went wrong, such that instead of 1285 traces, I'd get maybe 1 (if the cut was before the "=>") to a small handful (if the cut was after the "case") probably less than 5. Definitely still an improvement over 1285 different traces. Leaving out the current behavior would leave the failure at only a single point (the thing that failed, in this case "=>") which IMHO is good enough. I don't want my parser to start guessing which other traces may have passed (current behavior), I just want it to tell me exactly where it failed and what the thing it was trying to parse was!

Hopefully this sums up why cut would be useful ^_^

sirthias commented 9 years ago

Thanks for making your case so clearly! :) I completely agree with your assessment that the current error reporting is not good enough for a parser of the size you are working with. In fact, I think I have a unifying solution that combines the beauty of the existing error reporting (yes, it's there, you simply refuse to see it. Maybe because being global is not helpful for large parsers such as yours) with the directedness of cut markers:

We simply apply to current error reporting strategy only underneath the last cut marker before the error location. This means that with the help of cut markers you limit the error analysis space to what you want. Could be a single trace (as in your example) or could be 5. If you get 1285 you should probably add a cut marker somewhere. And we should limit trace gathering to a meaningful number, maybe 10 or 20 by default. Having things run for 2.5 minutes is clearly a problem.

lihaoyi commented 9 years ago

I don't think anyone has ever specified exactly what the existing error reporting does, not in emails nor in the docs, so I think I can be forgiven for not seeings its beauty ^_^ I just see it flooding my console with rubbish and taking 2.5 minutes to run

sirthias commented 9 years ago

Ok, I have just pushed build 2.0.2-SNAPSHOT for 2.10 and 2.11 to sonatype snapshots. It contains the fix for this ticket as well as #111. Let me know how this works for you.

/cc @paulp

lihaoyi commented 9 years ago

I haven't tried it out, but my first impression would be to make it ~! instead of ~!~. This is going to be in hundreds of places throughout my parser and so keeping it short makes sense.

It also has the benefit of being the same as what scala-parser-combinators uses, which is a + from a learnability standpoint

sirthias commented 9 years ago

I don't like foo ~! bar because the visual distance to foo ~ !bar is too small for my taste. And the one character more shouldn't be too bad.

lihaoyi commented 9 years ago

sounds good to me. I have some more feature requests I'll send to the mailing list