kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.59k stars 232 forks source link

Pearley #103

Open erezsh opened 8 years ago

erezsh commented 8 years ago

Hi, I opened this separate issue so we don't spam everyone :)

I'm glad you liked Pearley. It really is a direct port and the code is almost identical. I'll try to keep it aligned when new features are added to nearley. There is a minor difference, because I'm using an external library for lexing. But that's not significant.

I would be happy if you could explain line 67, but really I don't have a strong grasp on the algorithm anyway, so you may have to add a lot of context :) I would like to learn more about it, of course. I'm especially interested in making it faster, perhaps by pre-calculating some moves? I'm not sure.

As for ambiguity, I understand it can't be detected in the grammar. I still think it's possible to write code that will catch the common cases. Maybe I will try it some day. Another solution which would be relatively satisfying is if we could catch the ambiguity as it happens, instead of at the end of the parse. Perhaps that's impossible for some reason, I'm not sure (see: my lack of understanding of the algorithm). As for automating the debugging process - that's a nice idea, which I'm afraid to attempt :)

Sorry for the wall of text! Let me know your thoughts.

kach commented 8 years ago

Pearley is really cool! When you think it's "done", I would happily put a link to it in this repository's README file. Imagine being able to write a parser in nearley and get a Python parser for free?!

(PlyPlus seems like a pretty famous parsing module—you must share some of your "module advertising" tips with me sometime. :-) )

Okay, so the algorithm, which I've written about here, works by maintaining a table where every column represents a token of input. Each row of the table represents a partial parse (called an Earley state) and is basically the representation of a production rule's expansion, with a dot representing where you are in that production rule. This is the output of nearley-test, by the way. An example of an Earley state is

statement -> expression • ";" newline

At each iteration, nearley looks at each of these states in the current row of the table, and looks for states that are "expecting" the token you fed it. The state above expects a semicolon, so if you fed the algorithm a semicolon, it would create a fresh state in the next column of the table, that looks like

statement -> expression ";" • newline

Then, it goes ahead and adds new states if the new state expects a nonterminal. Here, for example, it'll add the states

newline -> • "\r\n"

and

newline -> • "\n"

This is what is happening on line 67.

Whenever you create an Earley state where the dot is at the end (i.e. not expecting anything), you go back and find where that state was created, and push it along. So if the next input to the parse was "\n", then you would add the state

newline -> "\n" •

Since this state is complete, you would bump along its creator, getting the state

statement -> expression ";" newline •

and then you'd have to go and find where that state was created, etc.

It turns out that when you have empty rules, then this process goes into an infinite loop because an empty rule is completed as soon as it is created. This sucks, because empty rules are actually useful. For example,

maybe_whitespace -> empty | whitespace

So, give or take, that's why there's that whole epsilon_closure trickery.

Hope that vaguely explains what's going on there. As a corollary, you should be able to figure out why ambiguity needs to be allowed by this algorithm.

Now, for debugging. I would suggest that first, you create special "instrumented" postprocessors for each production rule, such that the processed data knows the production rule that was used to make it. So, take

expr -> expr "+" expr | expr "-" expr

and turn it into

expr -> expr "+" expr {% function(d) {return ["production1", d]; } %} | number {% function(d) {return ["production2", d]; } %}

Then you parse the input that causes ambiguity, and get a data structure for each possible ambiguous parsing:

1+1+1 would yield

["production1", [ ["production2", 1], ["production1", [["production2", 1], ["production2", 1]]] ] ] ["production1", [ ["production1", [["production2", 1], ["production2", 1]]], ["production2", 1] ] ]

Then, do some sort of "tree diff" algorithm to figure out which of the productions is getting parsed in two different ways, and report it to the user. In this case, it would show that a production1 was being parsed in two different ways.

I would recommend writing this as a plugin for nearley before porting it to Python, so that I can help you out if you get stuck.

Hope that helps!

(Sorry if this is unclear, or if there are bugs in my writing. I type this in a hurry because I need to go to a meeting right now!)

erezsh commented 7 years ago

Hi, it's been a while but I have some updates!

First, thank you for the explanations.

I wrote a new library called Lark. It has your Earley implementation as the default, but also allows to switch to parsing using LALR(1) with the same grammar. I'm not using PLY anymore, I wrote my own LALR(1) implementation. I can write one for nearley if you like.

It seems that in this span nearley has become very famous and has much surpassed plyplus (my old library) in its popularity. Maybe I should be the one asking for tips.

It would be cool to have a converter from nearley grammar to lark, and vice-versa. I might just write one. It shouldn't be difficult to make an accurate translation, except for user-code of course.

kach commented 7 years ago

Hey! Lark looks super-cool; I really like the idea of having multiple engines "under the hood" to give users flexibility.

A nearley <=> lark converter would be amazing — it would allow people to convert easily between JavaScript and Python parsers! Even if postprocessors get stripped, it would be incredibly useful to a lot of people. Please do let me know if you embark on this project. :D

(P.S. The reason nearley suddenly became kind of popular is that someone recently wrote a Medium post that became popular on social media.)

erezsh commented 7 years ago

I'm definitely putting in on my TODO list!

To that end, I have a small question about your method of tokenization. I noticed that in csscolor.ne, you have: hexdigit -> [a-fA-F0-9]

But you also "import" the following unsigned_int -> [0-9]:+

These two regular expressions collide, and if one is chosen over the other it could cause a parse error. I can think of a few possible solutions for this, but they all seem either very complex or too slow. How do you handle this issue?

kach commented 7 years ago

nearley doesn't tokenize for you. It either parses a stream of characters "raw" (as in the csscolor example above), or it expects input to be already-tokenized. Does that answer your question?

erezsh commented 7 years ago

So in the case of hexdigit -> [a-fA-F0-9]

Nearley treats it like this? hexdigit -> "a"|"b"|"c"|...|"8"|"9"

kach commented 7 years ago

Yes, it's equivalent to that.

In reality, [a-fA-F0-9] is like a special "character" that is equal to "a" and "b" and "c", etc. Internally, nearley stories the possible terminals as either

  1. An object with a .test property, where xxx.test(tok) tells you whether tok matches (regexes fit this, because in JavaScript regexes have a .test property)
  2. Something that you can compare with === (in JavaScript, characters (i.e. single-letter strings) fit this)

So what happens is that nearley creates a nonterminal [a-fA-F0-9] which is a JavaScript regex, and when testing input it notices that the regex has a .test method and so tries to match incoming characters against the regex.

The code that does this, by the way, is here: https://github.com/Hardmath123/nearley/blob/master/lib/nearley.js#L45

erezsh commented 7 years ago

Nearley->Lark is done! I tested it on a few of the grammars in nearley/test and also on csscolor.ne, and it works great for a first version :) You can see it here: https://github.com/erezsh/lark/blob/master/tools/nearley.py The test() method shows how to use the resulting grammar.

I might do a Lark->Nearley at some point. The main challenge is splitting up those regular expressions into 1-char chunks lol

kach commented 7 years ago

Interesting that you wrote it in Python… I would totally have written it in JS, by hacking on top of the existing compiler.

But… very cool! I'm going to put a link to this in the README. :D

erezsh commented 7 years ago

It should be pretty easy to write a javascript version. I chose to use lark because I'm hoping to use Js2Py to convert the functions in the nearley grammars to python. ( https://github.com/PiotrDabkowski/Js2Py ) Out of curiosity, is there something similar to this in Javascript?

Btw, I chose to move the tools into the lark directory, so you can install them with pip: $ pip install lark And then do something like: $ python -m lark.tools.nearley ~/nearley/ < ~/nearley/builtin/number.ne

So if you can fix the link to point to https://github.com/erezsh/lark/blob/master/lark/tools/nearley.py I would really appreciate it. Cheers for the link, and sorry for the mixup!

erezsh commented 7 years ago

Oh, I see the link is simply to Lark. Nevermind then :) I'll add instructions on how to use the converter soon!

kach commented 7 years ago

Oh, wow, the Js2Py idea is a very clever way to handle postprocessors. Sadly, I don't know of any JS-to-Python converters written in JavaScript. Maybe it can be a two-stage process, where you first convert the grammar and then convert the postprocessors separately? Good luck!

erezsh commented 7 years ago

Well, it took longer than I expected, but it's working! https://github.com/erezsh/lark#how-to-use-nearley-grammars-in-lark

It's still experimental, but it's working! The only missing feature that I know of is macros.

kach commented 7 years ago

Oh, goodness, that's fantastic! Thank you, Erez.

Yes, nearley macros are certainly a bit of a hack. I'm sorry about that! :-) I suppose one solution could have been to operate on compiled .ne.js files — and hence have nearleyc handle macro expansion/importing/EBNF modifiers — but of course, depending on your architecture, at this point it might be too much trouble for you to make the switch. In any case, it shouldn't be too hard to figure out the macros, and I would be happy to help if you run into specific issues.

erezsh commented 7 years ago

I'm glad you think so! :) I wanted to write a Lark->Nearley converter too, but that would require lexer support in nearley. Perhaps I could do the lexing outside nearley.. Nearley should support parsing of non-string sequences, right?

Regarding the macros, we now have two ideas. 1) Use the compiled grammar. I like this, because a smaller interface is a better interface. I won't have to implement macros, and future features in nearley probably won't break my tool either. But that means I'll have to parse JS, right? So any changes to the .ne.js file will break the tool (and I don't think you won't to commit to freezing it at this point). Any chance I can get the output in some "standard protocol", similar to the .ne files?

2) Alternatively, I'm planning to implement macros for lark anyway. So when I do that, I can convert your macros to mine (and hope they're compatible).

I would prefer option 1, if there was an elegant way to do it.

kach commented 7 years ago

Hmm, yes, I forgot that the output of nearleyc is JS, not JSON. Perhaps the easiest way would be a quick 5-line nodejs frontend to nearley->lark that converted it into a Python-readable format? (One of the unsaid goals of nearley at some point was to have some sort of "universal" — or at least portable — format to express grammars… guess that didn't work out! :P)

As for lexers for lark->nearley, you should talk to @tjvr about that! We're working on (i.e. done with, but kind of too scared to merge) lexer support. :-)

erezsh commented 7 years ago

@tjvr How's the lexer coming along?

tjvr commented 7 years ago

@erezsh Nearly there! Have a look at #220 and https://github.com/tjvr/moo :-)

erezsh commented 7 years ago

@tjvr Looks nice. You can find line-breakers automatically, simply by looking for newline characters, or the dot-matches-newline flag. You might get false-positives but they will be rare, and only slightly affect performance anyway.

Another thing you should take in mind, if you haven't already, is that the ordering of regexps, by itself, is not enough to solve all the common requirements from a lexer. For example, try to make your lexer match "begin" as a keyword but "beginner" as a variable name (\w+). The easy way out is to use callbacks and let the user deal with it. But I used in Lark another method that solves it automatically for the common case. Let me know if you're interested and I'll explain it.

tjvr commented 7 years ago

@erezsh Yeah—moo already handles keywords, see https://github.com/tjvr/moo#keywords :-)

erezsh commented 7 years ago

@tjvr Great, this is exactly my solution too :) But I automatically identify keywords, because they are always just strings without any special characters.

GoMino commented 4 years ago

Hello Guys, thanks for your libraries! @erezsh any chance you have finished your lark -> nearley converter ?

erezsh commented 4 years ago

@GoMino Sorry, I didn't try to write it. What were u hoping to use it for?

FWIW, there's already a lot of infrastructure in place to make such a tool easy to write (i.e. Lark's internal state already serializes into a convenient json representation)