kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.63k stars 231 forks source link

Is full regex supported? #51

Closed matthew-dean closed 9 years ago

matthew-dean commented 9 years ago

Is this supported? foo -> [0-9A-F?]{1,6}

I'm getting a "no possible parsings" error when generating the grammar.js file.

kach commented 9 years ago

No, full regex is not supported. The regex-like syntax is only for defining charsets. In addition, you can use certain EBNF syntactic sugars (kleene star, plus, and question mark operators). These are explained in the README file (they look like token:*).

matthew-dean commented 9 years ago

Ah, ok. So, for that case, would I need to write something like...

chars -> [0-9A-F?]
foo -> chars | chars chars chars chars chars chars

?

YafahEdelman commented 9 years ago

Yeah, you'd need that, although it would be easy to add "{}". We could even add full regexes and implement them using js's underlying regex engine. Optionally we could implement more regex features ourselves.

kach commented 9 years ago

Yeah. I'm not too interested in adding {}, because it doesn't have as many use-cases and in most cases it can be replicated manually (since its arguments are usually not too large).

I'm not sure I grok what Jacob means by using JS's underlying regex engine, but it's worth realizing that nearley is character-based, so you cannot use an arbitrary multicharacter regex as a terminal.

matthew-dean commented 9 years ago

Well, it seems like if you have :* and :+ then specifying the number of repeats in order to match seems logical. But it's an okay workaround.

tjvr commented 9 years ago

We could even add full regexes and implement them using js's underlying regex engine.

You can do that, sort-of (I wrote my own js earley parser). However, you can't have both that and incremental parsing with feed().

YafahEdelman commented 9 years ago

I'm not sure I see why not? At the very simplest we could add some sugar for turning regexes into tokens that accept anything but with postprocessors that fail if the input does not match the regex (Postprocessors have the ability to declare a parsing invalid in Nearley).

tjvr commented 9 years ago

@hardmath123 is wrong: you can have multicharacter regexes in a character-based earley parser: they would sit somewhere between non-terminal productions and terminal characters. But it can't be a stream-based parser.

My parser took a string as input, instead of a character stream, so I could execute the RegExp against the remainder of the string. I stored the length of the match (if any), and then special-cased State.advance() to skip over each character in the match. (Which resulted in "fake" in-between States, if you like.)

If your input is a stream, you don't have the complete remaining input, and therefore can't do this hack. So you can probably implement better or even complete regex support into nearley by compiling it down to BNF rules; but you can't do it using JS's regex engine.

I don't know about postprocessors, so I can't comment on your suggestion. But it sounds like your suggesting adding a rule like .*, which sounds to me like it might be quadratic...

rwindelz commented 9 years ago

@Hardmath123 is right. An earley state is defined as being at a terminal token position in the input stream. Think about what happens when a multi character regex consumes the next 'n' terminals . . . where did the subsequent earley states go?

tjvr commented 9 years ago

where did the subsequent earley states go?

You fake them, as described above. It all works!

rwindelz commented 9 years ago

Regular grammars are a proper subset of the context-free gramars parsed by Earley so every regex is trivially expressible in a (N)Earley grammar. The faking as you describe it is an unnecessary complication in the core algorithm. Instead of having a single 'next' state comprised of the next terminal match transitions (the SCANs) you now have a set of up to 'n' future states where 'n' is the length of the regex match. Of course the 'terminals' of the grammar don't have to be characters. You can implement a lexer (by way of regex) on a character stream to generate the terminals fed to Nearley, however that is different from what you've described.

kach commented 9 years ago

See #57.