kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.57k stars 231 forks source link

Feature: support syntax to drop value of matched token #592

Open conartist6 opened 2 years ago

conartist6 commented 2 years ago

I have a fairly complicated line:

Frame -> At FrameFragments _  LB "as" _ FrameFragments RB _ LP FrameFragments CN Number CN Number RP
{% (d) => ... %}

One of the most annoying things about working with this line is that I'm continually having to fiddle with the d indexes. Is the filename fragment d[12] now? It breaks a lot. What I'd like to be able to do is specify some expressions (subgrammars? what is the correct word?) to have their results omitted from the results argument (d). If this were possible then my javascript could be:

([ function, method, file, line, col ]) => ({ function, method, file, line, col })

I don't know what the syntax would be. Actually I have any idea what syntax nearley actually supports since there is no API documentation, only guides.

conartist6 commented 2 years ago

I'd think the easiest way would be to define some magic constant which means to nearley "drop this result". Something like Symbol.for('nearley/drop'). Then the user could define a function returning the symbol const drop = (d) => Symbol.for('nearley/drop') and use it with a rule, i.e. as LB {% drop %}. Then nearley could filter out those symbols before providing a value of d.

KillyMXI commented 2 years ago

Destructure d and drop unneeded parts:

([,func,,,,,method,,,,file,,line,,col,]) => ({ func, method, file, line, col })

It's still fairly easy to miscount comma numbers though (I can't guarantee I made no mistakes above without testing). Refactoring your long grammar rules and breaking them into shorter parts will be a good thing to do, even if you aren't going to reuse the parts elsewhere.

conartist6 commented 2 years ago

I am aware of that feature of destructuring, but as you note it doesn't solve my problem -- merely moves it somewhere else. I think your caveat is particularly telling -- that though you did exactly the work you suggest, neither of us is sure if it's right. The library could provide a better solution.

KillyMXI commented 2 years ago

It never came to my mind to question this. In fact, I quite like the simplicity of 1:1 match of rule contents. – A nonterminal that disappears from the rule contents based on some logic defined in a remote location is going to be a hell to read and understand a month later after you wrote it. I wouldn't question myself to break a complex rule apart - it's just more convenient all around.

Anyway, at this point I think it is safe for you to start looking at different grammar-based parser generators. It doesn't seem like nearley will go anywhere. Some of alternatives offer features such as naming (see peggy) or discarding (see @thi.ng/parse) of rule parts via special markup.

conartist6 commented 2 years ago

That's a stylistic choice, and it certainly may be a valid one. I actually did break things down more as part of the natural development of my parser code (the initial example is called CallSite now), but now there's just a different big nasty line that doesn't break down.

Call "(" _:? "eval" _ "at" _ Text _ "(" _:? Site _:? ")" ( "," _:? Site ):? _:? ")"

I could deconstruct it down arbitrarily, but since the conceptual breakdown is already complete I'd rather not.

I'm also not convinced that any of these other parsers are going to hack it for my use case. This grammar is fundamentally ambiguous, and there's nothing I can do about it since I didn't come up with the grammar.

Of course I could always fork nearley, but I've been on a forking spree lately and I need to settle down and bring one set of projects to completion.

KillyMXI commented 2 years ago

I'm working on some weird stuff lately and I figured nearley has fundamental limitations that make it not working for me either (only CFG and only a very limited way to discard alternative branches). So I'm about to publish a parser combinators package (because of course there is no parser combinators package that would work for me either).

Why _:? tho? By convention, _ is an optional whitespace already, and __ is a mandatory whitespace.

Ok, of course there are going to be edge cases but I think I'd deal with this line somehow or live with it (except for underscores - extra :? make it quite noisy).

conartist6 commented 2 years ago

Ah right, I'll update that. __s sure would make it significantly more readable.

conartist6 commented 2 years ago

Oh somehow the first part of your message got italicized and I thought it was a quote. I've already separated this code out into two parsers so that I can prune branches on a line-by-line basis. I'll definitely check out your combinators package when it's ready.

KillyMXI commented 2 years ago

Looking back at the problem, after referring to peggy and @thi.ng/parse, it would still make sense to introduce named entries - add them to d besides numbered entries. I think that would be the nicest and non-breaking way to address this issue.

x! syntax in @thi.ng/parse is cool, but I think it will make the whole thing somewhat harder to understand. If designing a system from ground up, I would've tried to invert it and come up with named entries...

I've opened another issue for branch pruning: #591 (No hopes for nearley future though.)

conartist6 commented 2 years ago

Yeah named entries would be great, and even better if they were passed to the postprocessor in a separate argument. If the callback became (tuple, dict, ...args) => value it would be trivial to write function dict(tuple, dict) { return dict; }. In a lot of common cases this would eliminate the need for custom postprocessing.

EDIT: oh that's what you're saying: "add them to d besides numbered entries"

KillyMXI commented 2 years ago

There are existing optional args for postprocessors after d. Adding anything at the end would be unpretty and/or inconvenient. Switching to options object pattern would be a breaking change. Accepting an options object as an alternative to existing args would complicate the system.

So, adding stuff to d as an object would be least invasive and also pretty natural - just string keys in addition to numeric indices.

conartist6 commented 2 years ago

Yes I'm proposing a breaking change -- just shifting the current optional callback args right. There doesn't seem to be any reason not to propose a breaking solution as likely it could only happen in a fork.

conartist6 commented 2 years ago

The difference if you put the string keys and the indexed keys on the same object is that now it's useless as an AST node -- you need some custom code to copy the part you actually want into a new allocation (i.e. a new object or array), which is just work that doesn't need to be done. If the parser itself can build your AST nodes and the postprocessor can (usually) be one of a few predefined functions defined you'll have a physically smaller parser with fewer paths which churns less memory and gets fully optimized more quickly by the runtime.

KillyMXI commented 2 years ago

Good point.

KillyMXI commented 2 years ago

Although a sensible AST would require at least one extra property not parsed from the input - node type.

Trying to add that automatically as well will start to look like a system optimized for a single purpose. If I were designing a parser optimized for AST generation I might've done the whole thing differently.

Keeping things general purpose as they are while optimizing for a particular single goal in mind might lead to weird compromises.

conartist6 commented 2 years ago

A node's type is worth considering since it a sane parser would ensure that the type property was in every node's first slot. It's not a particularly difficult requirement to incorporate into an API though.

KillyMXI commented 2 years ago

I'll definitely check out your combinators package when it's ready.

Finally published and can get back to where I was before it started, sigh: https://github.com/mxxii/peberminta