Open joonas-fi opened 8 years ago
Which language are you implementing your parser in? Also I'm not quite sure I understand this statement:
I noticed there is _types field in the parser and in parse(..., options), but it seems to be unimplemented.
Can you clarify what you mean by that?
Just getting to know Canopy in JavaScript. Sorry for not being clear enough.
From http://canopy.jcoglan.com/langs/javascript.html the example (url.peg), run:
$ canopy url.peg --lang js
Now take a look at url.js. The exported parse() function has second argument options
:
var parse = function(input, options) {
options = options || {};
var parser = new Parser(input, options.actions, options.types);
return parser.parse();
};
By first noticing this "options" structure and the "types" key I was excited to think that this is precisely what I'm looking for.
But if you now take a look at the Parser
constructor:
var Parser = function(input, actions, types) {
// ... snipped ...
this._types = types;
// ... snipped ...
};
But by grepping through the url.js file, I can see that the _types
field is never accessed.
Ahh sorry, ok so now I understand the types
parameter, after reading http://canopy.jcoglan.com/types.html
It seems that currently my best bet at having the parse tree contain the matched rule name is to augment the rules with actions and implement actions in such a way that it would decorate the node with the desired rule name.
I'll post some sample code if I come up with some code that better demonstrates what I'm talking about.
I came in here to post a link but you already found what I was going to link you to :)
I would encourage you not to think of things in terms of rule names. There is a reason that rules, types, and actions are distinct things. Rules direct parsing control flow, but don't necessarily express types. One rule may include multiple nodes you want to attach types to, for example.
Think of it this way: each parsing rule compiles to a function. When you have an expression in a programming language, its meaning does not depend on the name of the function it appears in. The functions just exist to shape the set of procedures you can reuse. They direct control flow, not necessarily data structures.
It might be that you don't need the names of anything at all. If each type implements methods that do the right thing for the expression they're attached to, you should never need to know the name of the rule they appear in.
I was actually looking for the same thing, but to use it to make sure the right rules are matching when parsing various types of input. If you were going to print the generated parse tree, you get this in other libraries. I think it's not a big deal, but it might make testing the parser easier if you could compare the rule name to the node returned by the parse.
I can imagine that you wouldn't want this in production, so potentially adding a node accessor for "rule_name" as a --debug or some kind of option would be useful.
Maybe I'm missing something, but if you have complicated grammars, how do you know the right rules matched at the right time without this kind of annotation?
Any thoughts appreciated.
I've been looking at this again, and I really need it for building better error messages. My thought is something like augmenting the builders so that they set a local variable with the rule name (which you already have when you create the function (something like this):
builders/javascript.js:
212 cache_: function(name, block, context) {
213 var temp = this.localVars_({address: this.nullNode_(), index: 'this._offset', ruleName: name }),
Then, when you were actually in each function, the rule name would be pushed onto a stack of rules in the Grammar instance so that the current rule failure would have a rule name.
Now, we probably don't want to give a raw rule name to a user, but it'd be better than this:
SyntaxError: Line 22: expected [a-zA-Z], [0-9], "+", "-", ".", ":"
However, once you had the rule names, then your error handler could have a strings object of messages with "human" names for each of the rule name keys.
As far as I can see, this is a slight modification to the builder to get it to generate the correct code to set the variables and manage the stack and then modifications to the generated Parser instance so that it includes the top of the rule stack when it creates the SyntaxError instance thrown to the caller.
A few things that I'm not sure about are:
I'm trying to follow the directions in the CONTRIBUTING file, but when I try and run the tests, I get a lot of errors, and, since I'm still getting the hang of the npm environment, I'm not quite sure how to get it to generate a local package that I can play with interactively.
As far as your comment to @joonas-fi about not worrying about the rule names, I think that's fair in the normal processing sequence now that I understand how you're supposed to construct your own object graph as the rules fire. However, I don't think it's accurate when you're talking about error messages—especially when you're trying to figure out a grammar where a) you're still new to canopy, b) you're a little new to PEG and c) you're not 100% convinced exactly how to express the rules you want. :)
I'm trying to replace a hand-written parser for two existing languages with Canopy, and, after taking a few months getting sidetracked, I'm now making good progress as we're coming to an understanding. However, somehow, I need better error messages than the above, because the way they are, they're only slightly better than IBM BASIC (c. 1980) "Syntax error." messages.
Having this facility would not only make development of grammars with Canopy much faster, but it's also important to the end product you're building the parser for in the first place.
This is what I get when I try and build, and, for now, if I can just get it working the way I described with Javascript, I'll be happy. I don't need/want the other languages right now.
npm run build
npm ERR! Darwin 15.6.0
npm ERR! argv "/opt/local/bin/node" "/opt/local/bin/npm" "run" "build"
npm ERR! node v6.10.3
npm ERR! npm v3.10.10
npm ERR! missing script: build
npm ERR!
npm ERR! If you need help, you may report this error at:
npm ERR! <https://github.com/npm/npm/issues>
npm ERR! Please include the following file with any support request:
npm ERR! /opt/devel/archistry/canopy/npm-debug.log
Happy to help implement this, but I think I need a little bit of guidance on how things work to be efficient.
Cheers
I have just implemented a limited version of this focussed on improving the quality of error messages. For example, the error you now get on a parse failure looks something like this:
SyntaxError: Line 4: expected one of:
- [\s] from CanopyJson::__
- "," from CanopyJson::array
- "]" from CanopyJson::array
4 | "y": { "a": [1 }
^
I think showing users the rule name where a particular expected token comes from is helpful. I am less convinced that adding the rule names to successful parse results is beneficial; as I've elaborated in #47 is creates a strong coupling between the factoring of the grammar and the results it emits.
As a comparison, in some parsing systems you get a distinct node for each delegation a grammar contains; if a grammar contains rules like:
A <- B
B <- C
C <- "token"
Then parsing with rule A
gets you a tree of tokens containing every rule the parse traversed, i.e.:
+---+
| A |
+---+
|
V
+---+
| B |
+---+
|
V
+---+
| C |
+---+
|
V
+---------+
| "token" |
+---------+
In Canopy, parsing with rule A
produces only the leaf node for token
; you don't get extra nodes just because one rule references another. This makes refactoring grammars substantially easier because you can factor expressions out into their own rules without changing the shape of the output. I think that adding rule names to the parse tree would work against this goal in general, and it would add an extra parameter to everyone's action functions that in general I think they would be better off ignoring.
I'm on the fence about including the full stack of rule names in error messages, because it presupposes too much about the implementation and might be confusion on deeply nested rules, but I'll leave that under review. For now, I think this error message improvement is worth having.
One further thing I'll note here is that I don't want to commit to a way of identifying rule names that user programs might come to depend on. Something I'm still considering is if/how Canopy should support composing multiple grammars together, and that it likely to affect how rules are identified since it creates a namespacing problem.
For now I'm happy to include this in error messages but less keen to put them in data that might break its format in later releases.
E.g. how do I know what type of an element was matched if the parse tree does not contain the matched rule name?
Let's take a hypothetical programming language, which I'm trying to parse:
And when parsing the class definition, class can contain 0..N (propertyDefinition OR methodDefinition).
When I get the parse tree and am handling the class definition node, it's sub-elements are (ideally): 1) property definition 2) method definition 3) method definition
Sure, I could detect the type of node by seeing if the object has key "property_name" (=> it's a property definition) or if it has key "method_name" (=> it's a method definition), but that's just ugly.
I noticed there is _types field in the parser and in parse(..., options), but it seems to be unimplemented.
Any advice on this?