kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.59k stars 232 forks source link

disclude `undefined` from data array, provide notreegen #55

Open matthew-dean opened 9 years ago

matthew-dean commented 9 years ago

Related to post processing and eliminating nodes, is there a value I can return from a matched token that will eliminate it from output automatically?

For example, it would be nice if I could just drop optional whitespace, such that--

selector _ combinator _ element

-- would return an array of 3 elements instead of 5. I thought I could write:

_ -> null

-- but then that just returned an actual null value. I can see why I might want to post-process based on null, so I'm not sure what to suggest, maybe a special nearley token? Like: %null% or something?

kach commented 9 years ago

I've been thinking about this for a while.

PegJS solves this problem the "ugly" way: you define a "tag" for each token you want to save, as in:

something -> name:string _ "=" _ value:number {%
    return [name, value]
%}
# cow=45 becomes ['cow', 45]

I dislike this because—as a matter of design principle—I strictly want postprocessors to be functions, not bits of JS code that are spliced in somewhere. On the other hand, I like having names for semantic clarity, so perhaps a hybrid approach like this might work?

something -> string(name) _ "=" _ number(value) {%
    function(d) {
        return [d.name, d.value]; // can also do [d[0], d[4]]
    }
%}
# cow=45 becomes ['cow', 45]

Basically, this would overload the array with name-value pairs as defined in the grammar. It also uses parentheses instead of : because it looks more natural and doesn't cause confusion with the a:+ syntax.

Thoughts?

matthew-dean commented 9 years ago

Well, first, I'm not sure that's exactly what I'm asking. Because I can do d[0] + d[4] now, which isn't bad at all. Unless you're saying that if you tagged tokens in the expression, then they would be the only ones returned as data? If so, that would work, because then I wouldn't have to write functions when I wanted to omit spaces.

As far as the syntax, I like the cleanliness of the grammar right now, and being able to move functions off to the side. And I think the colon assignment is actually more consistent with what you have for :? and :*. (Although, incidentally, many grammars can be copy / pasted verbatim, except they have ? and * respectively without colons, kinda curious about that.)

I think your parentheses are problematic because they visually conflict with capture groups. Is "name" an entity reference in a capture group, or is it it a tag? So, bearing those things in mind, what about something more lean like:

something -> name:w _ "=" _ value:d {%
    function(d) {
        return d;
    }
%}
# cow=45 becomes ['cow', 45]

That is, strings are tagged :w for alphanumeric word (string), or :d for digit (matches their regex counterparts, as do ?, *, +, so the logic holds up, for the most part?). You could still mix tag names with types.

something -> name:w(animal) _ "=" _ value:d(herdsize) {%
    function(d) {
        return [d.animal, d.herdsize];
    }
%}

Maybe a little messy. But I really like the idea that assigning a label auto-converts the data into an object rather than an array. Cool idea, and great for self-documenting the grammar.

Also, I would LOVE to also be able to do something like this:

something -> ('#' ident):w

-- as a shorthand for return d[0] + d[1] (assigning the capture group to a single tag to return it as a string instead of an array -- does that hold up logically?)

matthew-dean commented 9 years ago

Alternatively, you could use :a and :0 for alpha string and numeric, as JS types don't line up exactly with w and d regex chars.

Side note - the home page says you might like some help optimizing the output JS. We could talk about that some time.

YafahEdelman commented 9 years ago

The last idea of capturing sequences is definitely something we can do. I've experimented with similar things with programming language design. I think we should approach this like one would approach typing in some of the more functional programming languages. Union typing and such all mimic BNF style structures. I've always loved the parallel between grammars and type systems and we could definitely use the syntax from type systems for nearly.

YafahEdelman commented 9 years ago

Maybe we could encase regexes (such as "[a-z]") in slashes in order to free up square brackets for use? We could even use Nearly to convert the regexes into plain Nearley.

kach commented 9 years ago

Ah, I think I get what you're saying now. Do you want a mechanism for making parse-tree-generation optional for certain nonterminals? That optimization should be pretty easy to implement in the parser, and the only question is how to specify it syntactically. Some ideas:

@notreegen thing
thing -> a b c d e
OR
thing -> a b c d e @notreegen
OR
thing -!> a b c d e

notree would prevent tree generation (recursively) for that rule (or nonterminal?), and return undefined as the value.

Additionally, it might help to simply not include undefined values in the d array? So rules that postprocess to undefined (and, therefore, notree rules) simply don't show up in the array.

The above, of course, would be very annoying unless we also had name-tagged rules because specifying array indices will be tricky if certain things will be missing!


In any case—I hadn't considered () clashing with capture groups. How about using literal dots? So you'd have:

thing -> string.animal _ "=" _ number.herdsize {% function(d) {return [d.animal, d.herdsize]; } %}

It forms a nice visual pneumonic with the object (as you can see, I'm capitalizing on having a LESS guy here—designing elegant syntax is hard!).

Finally: I'm not sure I understand how :w and :d (or :a and :0) would work, to be honest. Are they custom-defined names? Built-ins? What do they do?

matthew-dean commented 9 years ago

Well, I think because Nearley only supports the single character match subset of regex, it makes sense to use the square brackets. Slashes I think would confuse people that a full regex is supported. You could also use braces as a third option. If you wanted to do it like traditional typing, it's often cast from the front of the var. So here's another variant.

something -> {w}name:{animal} _ "=" _ {d}value:{herdsize} {%
    function(d) {
        return [d.animal, d.herdsize];
    }
%}

Dunno, that gets messy again a bit. Maybe the cleanest is:

something -> {w}name:animal _ "=" _ {d}value:herdsize {%
    function(d) {
        return [d.animal, d.herdsize];
    }
%}

Orrrrr you could just use the token optionally as the object key

something -> {w}name:{} _ "=" _ {d}value:{} {%
    function(d) {
        return [d.name, d.value];
    }
%}

Of course, by this time, it's becoming so cluttered as to not offer any syntactic advantage over a postprocessing function. I like the name:w form the best for leanness. And perhaps if you wanted to label output, it could be a bit more more separated and labeled as an object literal.

something -> (name:w _ "=" _ value:d){animal, herdsize} {%
    function(d) {
        return [d.animal, d.herdsize];
    }
%}
# cow=45 becomes ['cow', 45] as d = { animal: 'cow', herdsize: 45 }
rwindelz commented 9 years ago

Non terminals define types and their constructors implicitly. The default value returned by a Nearley parser is just the array containing the parse values from each token in the definition. This causes us to lose a bit of information. One formalism to follow might be something like: a -> b "=" c {% function (d) { return { "name":"a", "lvar":d[0], "expr":d[4] }; } %}

That's a substantial bit of boilerplate for each rule which we might replace with a small DSL. Consider: a -> b "=" c {{%% "lvar":0, "expr":4 %%}} which generates the earlier post processor. NB, the post processor automatically includes the 'type name' "name":"a".

There are some parser generators that allow you to mark which non terminals should not return a parse value.

matthew-dean commented 9 years ago

@Hardmath123

So rules that postprocess to undefined (and, therefore, notree rules) simply don't show up in the array.

Yes, that would be ideal. I saw the undefines when I forgot to return a value from the postprocessing function. But returning no value should do exactly that, return no value! But an even simpler "undefined" or "notree" tag would help, although again, I like the way you've simplified tokens. So I probably like the ! option best.

However, since I read -> as a single operator, this maybe makes more sense?

thing !-> a b c d e

Or

!thing -> a b c d e
# by default, never add to token output

Maybe the last one is best, to tag the token itself, because you could use that tag in referencing too.

two -> "blah"   # normally output
otherthing-> one !thing !two
# matches but omits thing and two from data output

Finally: I'm not sure I understand how :w and :d (or :a and :0) would work, to be honest. Are they custom-defined names? Built-ins? What do they do?

Ah, sorry that wasn't clear. These were just symbols to cast types for those tokens, casting to string and number, respectively. However, the first isn't really needed since it's a string by default. I thought that's what you were doing when you had them labeled as.... OHHHH my bad, I altered your original example and started using "string" and "number" as my token names. Oops, I meant to write "name" and "value". I will change it.

matthew-dean commented 9 years ago

I see why I did that, you were writing `string() to represent the type casting. Whoops.

matthew-dean commented 9 years ago

I've updated my examples to not look utterly confusing and explode your brain. Sorry about that.

matthew-dean commented 9 years ago

Orr the leanest yet:

!_ -> [\s]:*
something -> name _ !"=" _ value  {animal, herdsize:d} 

# cow=45 becomes { animal: 'cow', herdsize: 45 }

Basically the object literal following the tokens allows you to implement both labeling and typing

YafahEdelman commented 9 years ago

We could actually add full regexes by unwrapping regexes. Aka, the new /[a-z]+/ would get converted too the old [a-z]:+. Its definitely doable and would the slashes would look nicer syntax wise I think.

kach commented 9 years ago

Whoops, looks like that's my bad! In my example, string and number are both names of nonterminals. They have nothing to do with typecasting—I don't see that as an issue, really (see bottom of this post).

But an even simpler "undefined" or "notree" tag would help

Ok, so this goes on the to-do list. I like the global @notree flag the best, because I don't see many cases where a nonterminal would be used both in a tree-gen and a no-tree-gen scenario. I also want it to explicitly work for all rules defined for a single nonterminal, i.e., you should never be able to do:

a !-> "a"
a -> "b"

Basically the object literal following the tokens allows you to implement both labeling and typing

Eh, I still think it's very important to have legitimate lambdas as postprocessors. You can always implement what you wrote as a higher-order-function and use it as a normal postprocessor as:

something -> name _ "=" _ value  {% makeobj("animal", "herdsize") %}

or something. Lots of room for creativity!

By the way, for typecasting, you do realize you can do this in the rule for value itself, right?

something -> name _ "=" _ value
value -> [0-9]:+ {% function(d) {return parseInt(d.join(""));} %} # typed integers!
matthew-dean commented 9 years ago

By the way, for typecasting, you do realize you can do this in the rule for value itself, right?

Oh. Duh, of course.

Eh, I still think it's very important to have legitimate lambdas as postprocessors. You can always implement what you wrote as a higher-order-function and use it as a normal postprocessor ...

Yeah, if you notice from my gist, I actually have {% token(0,1) %} as one of my custom functions. But it would be neat to write:

something -> name "=" value  {0,2}

And just cherry-pick token name and value as members of the array.

I like the global @notree flag the best

I actually think the naming is a little awkward. I initially read it as "not ree" for some reason, and had to reread it a couple times. But that could be just me. So how does it work as a global flag?

kach commented 9 years ago

As a global flag, it would look like this:

_ -> whitespacechar:*
__ -> whitespacechar:+
comment -> "/*" commentchar:+ "*/" | "//" commentchar:+ "\n"
@no-tree _
@no-tree __
@no-tree comment

I prefer this because (1) you can enumerate no-tree nonterminals in one place, (2) you can't accidentally create a conflict where only some, but not all, rules for a non-terminal are no-tree'd, (3) you can quickly toggle no-tree-ness for a nonterminal without having to change it in multiple places.

matthew-dean commented 9 years ago

Er... Oh, okay, you mean global as far as the usage in the whole grammar. I thought you meant a single flag per document.

Hmm... what about also allowing on the declaration? Oh, wait, I get it. You can have multiple "declarations" for a token, is that right? I haven't used that yet because I didn't understand it. However, I think in case of conflict, any "no-tree" flag would apply to all, because that's the most likely intent of the author (which is useful when trying to trim language verbosity :wink: ). I still think "@no-tree" is longer than it needs to be. And the idea of no-tree isn't immediately intuitive for noobs. That is, it implies that the single token won't construct a tree, when it's actually more like omitting a branch or node. Something more semantically generic and straightforward like "no-output" or "hide" might be better, since the meaning is more obvious. Or even "collapse", since web devs have a sense of that term for omitting empty or invalid nodes.

kach commented 9 years ago

Yes, you can have multiple declarations (arrows) for a token. The pipe (|) is just syntactic sugar for multiple arrows (this is how parsing academia notates grammars).

@collapse has some connotation of "flattening", which is not really what's going on. It's not even "omitting" a node, really, it's just refusing to create the node in the first place.

I like @no-output.

Aside/note-to-self: compiling a @no-output rule with a postprocessor should throw a warning.

matthew-dean commented 9 years ago

Ah. Well, you know more about the syntax of grammar and about your parser generator, so it's likely for my suggestions to miss the mark. After all, I'm basically a noob at even the idea of grammars. But glad if some of my suggestions are useful. :smile:

bojidar-bg commented 9 years ago

Just my two cents (and my opinion about all the previous discussions...)

//(explicit, rule-bound) A prefix
@no-tree-or-no-output-or-just-match-or-whatever __ -> [\s] | __ [\s]

I somewhat like it, it is quite readable, and understandable, but as @Hardmath123 said, it can be confusing with multiple non-piped declarations.

//(explicit, rule-bound) Global configuration
__ -> [\s] | __ [\s]
@no-tree-or-no-output-or-just-match-or-whatever __

I actually dislike it, because the configuration is too much out of the context of the rule.

//(explicit, rule-bound) Changed arrow
__ -!> [\s] | __ [\s] //(This arrow looks better than !-> or -]> IMHO)

That's the one I like the most, as it is more readable than (1). About the multiple non-piped declarations argument, I will say that it might be required to use only -!> or -> for a single rule.

//(explicit, use-bound)
a -> number !__ number {% /*now d is like [3,5], and not like [3,null,5]*/ %}

Tolerable, but it can be emulated by all the other variations by just changing the rule name. Also it might break existing grammars...

BTW, why >60% of this discussion have nothing to do with the issue topic? If you ask me, this issue should be split in at least 3 issues....