jsinger67 / parol

LL(k) and LALR(1) parser generator for Rust
https://jsinger67.github.io/
Apache License 2.0
184 stars 18 forks source link

Use the fancy_regex crate instead of regex #191

Closed ethindp closed 1 year ago

ethindp commented 1 year ago

Would it be possible to use the fancy_regex crate over regex? It adds features like lookbehind/lookahead, which a grammar I'm writing needs (I need to ensure an identifier is (not) a keyword). It also supports character classes.

jsinger67 commented 1 year ago

Hi @ethindp, Thanks for your interest in parol. Actually parol uses the crate regex-automata for the realization of it's lexers. Anyway it isn't easy to change the used regex crate for several reasons. One of them is runtime performance. But maybe I can help you with some useful ideas to solve your tasks. To ensure that an identifier is not a keyword you have to define all keywords before the terminal of your identifier. This prioritizes the detection of keywords. Character classes are supported by regex-automata out of the box. Please let me know if I can help you any further.

ethindp commented 1 year ago

@jsinger67 Thank you for that, I didn't know that regex-automata allowed that already. I completely understand the performance angle. I'm looking forward to giving Parol a try to see if it can handle (or help me help it handle) the particular grammar I have (which is quite complicated).

jsinger67 commented 1 year ago

Parol is definitely worth a try 😄 And I can give you support. Also the book should be a helpful reference.

ethindp commented 1 year ago

Thanks! (Parol is attempting to comulcate k for my grammar.... It's taking a while to do that so I honestly am wondering if K might be infinite... I hope not but if so, I might need to left-factor.) The grammar I'm working with is ambiguous (I'm working on writing an Ada compiler) but getting it to parse would be great!

jsinger67 commented 1 year ago

Try to start with small pieces. To commission a large grammar can be time consuming. Try to minimize k as much as possible. Left factoring is done by parol. For this please have a look at the generated exp.par file.

ethindp commented 1 year ago

@jsinger67 You aren't the only one to tell me to do this, and I'm trying to do that. But the grammar (mainly the name rule) is ambiguous and left-recursive, and almost everything (minus a few constructs) depends on that rule. I wonder if left-factoring can break that cycle.

jsinger67 commented 1 year ago

One good technique is to comment out some less important alternatives of a certain non-terminal near the start symbol of your grammar. This will temporarily cut off a great portion of your grammar.

I can offer you to have a look at your grammar.

ethindp commented 1 year ago

Any help would be appreciated. The majority of left-recursive rules are indirectly left-recursive; the rule they all have in common is name. (I've tried to refactor this into a name_tail rule already.) I've attached the grammar. The name rule is on line 315. alang.txt

jsinger67 commented 1 year ago

I will have a look at this soon. First impression: Lots of productions with "name" on their RHS. This is actually redundant and results in larger equation systems parol has to solve.

ethindp commented 1 year ago

@jsinger67 I know, I'm planning on eliminating those later once I get this to work. The name rule is a tangled mess.

ethindp commented 1 year ago

At least, the ones I can (e.g. reference_object_name). Might also be able to take care of prefix too.

jsinger67 commented 1 year ago

@ethindp, there are about twenty recursive non-terminals in this grammar. I have to convert them manually into a non-recursive form. This will take a while but should be a good practice for me 😉. Or, even better, I will provide support for this kind of problem somehow in parol itself. The transformation of a traditional left-recursive grammar into a right-recursive one seems to be an underestimated use case.

jsinger67 commented 1 year ago

I removed all left-recursions successfully. To convert indirect left-recursions into direct ones I had to "inline" some productions so that they don't exist as standalone ones anymore. I commented these changes thoroughly. Anyway, the grammar still has some flaws. Parol decries some unreachable non-terminals which might be a problem. Please note that I renamed the start symbol from “compilation” to “ada”. You can, of course, revert that. Feedback appreciated. Have fun. 😎

ada.txt

ethindp commented 1 year ago

Thanks! I'll look at the changes -- I'll definitely look at modifying the grammar further to get it to stop complaining. (It'd be nice if LL(k) supported direct/indirect left recursion, but from what I understand this inherently isn't possible.)

jsinger67 commented 1 year ago

You're right. LL parsers can't handle left recursions. To remove them automatically is possible but can result in major redrafts of the original grammar. I'll definitely think this over and try to provide more support for users here.

ethindp commented 1 year ago

If it wouldn't take a significant amount of time, could you do it during code generation? (Also, another alternative that could be used to disambiguate things, if necessary, would be something like an interception function that you define that's called on rules you know are ambiguous, which would allow you to influence which path is taken based on arbitrary lookahead; Coco/R does this, and it's quite neat.)

jsinger67 commented 1 year ago

@ethindp I tried hard to minimize your grammar. Especially, I did some aggressive inlining of productions if they were in the form of A: B; Then I substituted all occurrences of A by B and removed that production. I did this recursively and finally there showed up some ambiguous productions in the form of X: Y | Y | Y | Z; which I minimized to X: Y | Z; At some final state I had an expanded grammar with still 1435 productions. The equation systems to solve for FOLLOW(k) then still contained 2157 equations and the equation system for FIRST(k) had a size of 1436 equations. This is much but normally not a problem if k is limited. For instance, the grammar of veryl has equation system sizes of 1161 resp. 813, but k is 3 and thus the grammar automaton is completely computable.

The computation of your grammar for a limit of k <= 3 finished after about ten minutes on my machine with no solution found. Maybe some very powerful computer could calculate a solution with k in the range of 5..10.

To make a long story short. I think, parol is not able to handle such grammars effectively at this stage of its development. The top-down parsing strategy it uses needs a grammar with finite (and not too high) value of lookahead. I'm sorry I don't have better news.

ethindp commented 1 year ago

@jsinger67 It's alright, I completely understand. I've been struggling to find a good way of implementing this grammar for a while now. I've thought about using parser combinators, but the way Nom and such work don't exactly give me confidence in my ability to translate the grammar into a form that Nom/combine/pom would like.