pegex-parser / pegex-pm

Pegex Parser for Perl
62 stars 22 forks source link

Why is the start rule feature undocumented? #77

Open bpj opened 4 years ago

bpj commented 4 years ago

Why is the start rule feature undocumented? I have a very good use case for it. Should I refrain from using it?

mohawk2 commented 4 years ago

Sorry for responding slowly. It's documented that the first rule is the "start" rule. Can you spell out a bit more what problem you're facing?

bpj commented 4 years ago

Thanks for answering.

Looking at the code I found that Pegex::Grammar has an undocumented attribute start_rules which takes an arrayref of alternative start rules, and the parse method of Pegex::Parser takes the name of one of these rules as an undocumented third (second not counting the invocant) argument and will use the named rule as start rule if that argument is present.

This feature is useful for me because the DSL I'm working on takes a path through a data structure as part of its main input but also allows bash-like indirection where a path can be fetched from a value in the data structure itself, parsed and resolved while the AST is being evaluated. Since the syntax for these dynamically obtained paths is the same as for paths in the main input, and hence a subset of the main grammar can be used to parse them, it makes sense to use the undocumented start rule feature for this. It also makes development and maintenance a lot easier since I can keep the whole grammar in a single file and a single module rather than having the subset for parsing paths in a separate file and concatenating it with the rest in order to parse the main grammar.

I have used this feature for several days now and it seems to be fully functional. The only problem is that it is not part of the documented API and so I'm worried that it might go away and that it may not be safe to rely on it.

Den tis 24 mars 2020 08:45mohawk2 notifications@github.com skrev:

Sorry for responding slowly. It's documented that the first rule is the "start" rule. Can you spell out a bit more what problem you're facing?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pegex-parser/pegex-pm/issues/77#issuecomment-603080341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI3OU7FWH3TUSZRO5LFEXDRJBQIBANCNFSM4LLMZ2SA .

mohawk2 commented 4 years ago

Why not make a PR to document it? @ingydotnet any thoughts?

@bpj As an alternative thought, it sounds like you're taking quite a code-driven approach to your parsing. Have you considered a more data-driven approach whereby you produce the entire parse-tree (which means you'd only need to call parse on the top-level document), then give it semantic meaning in a following phase?

bpj commented 4 years ago

@mohawk2 I think you don't understand. It has everything to do with my "approach" to the data.

  1. First I parse the input text which contains certain directives, some of which contain "paths" to some location in a data structure which as yet is unavailable.

  2. The AST is traversed/evaluated to produce the output. Only now is the data structure which the "paths" point into provided as an argument to the evaluation method.

  3. Some "paths" point to values which themselves contain strings specifying "paths". These secondary/"indirect" paths must now be parsed using a subset of the main grammar and the values they point to are then retrieved from the same data structure, which was not available during the main parsing phase.

There simply is no way both paths can be parsed during the same phase, because the secondary/indirect path does not yet "exist" during the main parsing phase.

As a concrete example suppose the main text contains a path !/foo/1/bar, where the ! indicates that the value pointed to by this path contains the path to the actual value. Now in phase 2 the evaluation method is passed a data structure looking like this (represented as YAML for brevity):

foo:
  - some value
  - bar: '/biz/buz/quux'
    # presumably more data here
biz:
  buz:
    quux: The actual data
    # presumably more data here
# presumably more data here

When the evaluator sees the piece of the AST representing the primary path /foo/1/bar it fetches the value pointed to by that value, which is the "secondary" path /biz/buz/quux. The evaluator now calls on a parser instance to parse that path using the subset of the grammar which parses paths, and then the evaluator fetches the value "The actual data" pointed to by that path.

Since the syntax for specifying paths is the same in both cases it is only natural to parse the "secondary" path using the same grammar, but using the path rule as start rule instead of the full_input rule. Note that the path syntax is a bit more complex than just / (: SLASH WORD+ )+ /, since there can be keys which don't match / WORD+ /: even "simple" keys are Unicode aware, so that the actual regex for matching a "word" is more along the lines of

/ (: (= \pL ) \X (: \p{Dash}? (= [\pL\pN] ) \X | _ )* /

Please see perlre and perluniprops documentation if you don't know what these escapes mean. Basically "A 'word- starts with a letter in the Unicode sense, followed by zero or more underscores/letters/numerics in the Unicode sense, possibly with following combining diacritical marks and possibly separated by dashes in the Unicode sense". While this is a regex it is complicated enough that I don't want to have to maintain it in more than one place! There are also keys which are "quoted" using angle brackets and may include whitespace, character escapes, a syntax for references to characters by codepoint and some other things, notably slashes (/people/<Kurt G$#<0xf6>del>/<email/url> is a possible example — I won't go into the matter of matching hash keys with Unicode normalization!), so you can't just skip over the path using some regex and parse it later, but you have to parse the path in the main input and each key in it to see where the path ends, and again I would like to not need to keep the same piece of grammar in more than one place, and besides the "pointy-quoted string" syntax is not only used in other places too, but moreover there is also (although not permitted in paths) a variant which uses the Unicode pointy brackets ‹…› and a "double-quoted" variant with <<...>> or «…» which allows data interpolation, so there are already four grammar rules, each with their recursive subrule for nested balanced delimiters, which are very similar, and which I want to keep all in one place. (FWIW there is a point in not using ordinary ASCII quotes and backslash escaping: you are supposed to be able to use this syntax inside YAML or JSON quoted strings without ending up in Escaping Level Hell! Thus the ASCII quotes and backslash are not used in my DSL syntax.)

I hope this explains better what I mean. Note that English isn't my native language, which unfortunately may mean that I don't know the right words to use for some concepts.

I'll be happy to take a stab at documenting the alternative start rule syntax if there is an interest.

mohawk2 commented 4 years ago

I don't understand why you wouldn't have a rule called something like pathspec. You could then, in the original parse run (not requiring a second call), have a rule something like:

dollar-ref: TICK BANG pathspec TICK

That way the original AST would contain the pathspec already parsed.

Are you sure you're not solving the wrong problem here? :-)

bpj commented 4 years ago

Of course the grammar for the whole language would contain the rule, and an AST from a parse of a whole text would contain the paths contained in that text, but some paths are fetched from elsewhere after the whole text/program has already been parsed. Now how would I parse a string containing a path fetched from elsewhere, which is not embedded in any other text without specifying the rule for parsing a path as the start rule instead of the top rule used when parsing a whole program?

The problem is that I need to parse some strings using a subset of the grammar. I can't see how I can do that without either

I can't see what would be wrong with the second approach. I could of course set things up so that the grammar always parses either a whole text or a bare path, but that seems wrong, since sometimes I want a whole text and sometimes I want a bare path, but never either/or.

mohawk2 commented 4 years ago

My gut says that if you provide a suitable subset of your program, I can provide an answer. Please prove me wrong so we can justify this API change :-)

ingydotnet commented 4 years ago

I think you both misunderstand what start_rules is for.

It is a set of rules passed to the Pegex compiler. The compiler takes a textual Pegex grammar and turns it into a grammar object. That's phase 1.

Then it does a combinate phase. It takes the starting rule and follows all the rule references and does certain combining effects. Any rule that is not reached in this process is removed from the grammar object. Note: they don't need to be removed but currently that's what happens.

So start_rules is a list of alternate starting rules whose trees contain rules that you want to survive the combinate phase, that otherwise would not.

Now there is a related concept in Pegex::Parser of a starting rule. Look in Pegex/Parser.pm and you'll see:

sub parse {
    my ($self, $input, $start) = @_;

You can do a parse with the grammar using an alternate starting rule. This sounds like what you are trying to do. You only need the starting_rules attribute if the compiler is throwing out the non-default rules you need during its compile/combinate phase.

It doesn't sound like you need start_rules at all because the rule in question is already part of your main start rule, so it will be available also for your alternate start rule.

I hope I understood things right, and that this is helpful.

bpj commented 4 years ago

You can do a parse with the grammar using an alternate starting rule. This sounds like what you are trying to do.

Yes, that's what I'm doing, successfully. It's only that since the alternate start rule feature is undocumented I was concerned that it may not be fully functional — although after 2+ weeks of using it that concern is gone — or that it might go away, so I'm mostly looking for assurance that it won't go away before I depend on it. As I said I'd prefer not to have to keep the subset of rules used for parsing path specifications in a separate file/string since (a) keeping track of what should go where is an extra hassle, and (b) keeping everything in one place makes inlining the compiled grammar much less problematic.

You only need the starting_rules attribute if the compiler is throwing out

the non-default rules you need during its compile/combinate phase.

I understand that. I've been using the start rule feature for testing some subsets of the grammar during development, and sometimes I had to use the starting_rules attribute, but it's correct that I don't need it for path specs, the subset that will be used as alternate starting point during production.

ingydotnet commented 4 years ago

The optional start rule feature will not go away. It should be documented. Freel free to make a pull request if you'd like to do that.

bpj commented 4 years ago

Thanks! I'll look into making a pull request.

bpj commented 4 years ago

If it's OK I'll leave this issue open as a reminder until I've made that docu PR.