tolmasky / language

A fast PEG parser written in JavaScript with first class errors
languagejs.com
MIT License
411 stars 48 forks source link

Ignore matches #6

Open kkaefer opened 13 years ago

kkaefer commented 13 years ago

The functionality to match something, but not add it as a child node would be useful. E.g. I usually don't care about whitespace, but I have whitespace nodes littered over my AST. The excellent LEPL uses the ~/Drop operator to match, but ignore input tokens.

tolmasky commented 13 years ago

Hi kkaefer,

I have considered this syntax addition (and it has been suggested to me by others as well), and I am not yet 100% convinced we should add it (but I will admit I am very tempted). Let me list some of my concerns and we can work our way from there.

From a philosophical point of you, language.js' MO has always been "this is not the language transformation step, this is the tagging step". In other words, it may make more sense to think of language.js as a syntax highlighter: you are actually going through and annotating the text and giving it structure that way, rather than evaluating it (in other words, language.js produces a CST instead of an AST). For an example of a real like inconsistency that would arise consider the innerText property of all nodes. Say we have "x y z" being parsed as:

+ parent
+--x
+--y
+--z

As you can see, we've dropped the whitespaces here. Calling innerText on the x,y,z nodes works as expected, returning "x", "y", and "z". However, counterintuitively, calling innerText on the parent returns "x y z", so there is a discrepancy of information. The nodes don't actually store any strings, but are rather ranges that point to the original source (again, think of this as "tagging" the document), so we can't change simply change the parent to "xyz" easily (and this is probably not desired either). The question thus is whether we are comfortable with having this discrepancy (maybe we are and it is not a big deal) -- I don't know the answer yet.

kkaefer commented 13 years ago

Maybe this feature could be added by not dropping them on parse time but skipping over "dropped" tokens on traverse time, similar to how traversesTextNodes works

tolmasky commented 13 years ago

Yeah that is certainly an option, either have skippedNodeNames:["WhiteSpace", "SomethingElse", etc] or the other option would be to traverse the tree and manually remove them oneself with something like tree.removeNodesNamed(...).

In a world where we did add the explicit operator, it might be nice to be able to apply it to rule definitions as well:

~WhiteSpace = ... // now anywhere WhiteSpace is used it is dropped, that way you don't have ~'s all over your grammar.