patrickhuber / Pliant

MIT License
26 stars 4 forks source link

Request for feature: support of except-symbol as in part 4 of ISO EBNF syntax standard #75

Open ArsenShnurkov opened 7 years ago

ArsenShnurkov commented 7 years ago

Here is the grammar which I want to use:

a_file := v_file | o_file ;
o_file := file - v_file ;
v_file := { a_directive } ;
file := { directive } ;

a_directive := v_directive | o_directive ;
o_directive := directive - v_directive ;
v_directive := "00" ;
directive := "00" | "11" ;

I don't know how to rewrite this grammar for Pliant (with cosmetic-only changes), because "-" operation in Eto.Parse is not the same as in ISO EBNF. In Eto.Parse the "-" operation have some different semantic. Rule matches left operand greedely, but resulting match should not contain the right operand.

It is also not the same as negative or positive lookahead of Perl regexp's.

I propose to define and describe the own syntax of Pliant and add there an "^" operation which will have this semantic (this way it will be possible to use "-" as in ISO EBNF).

patrickhuber commented 7 years ago

The Early algorithm will probably parse that grammar just fine without using presedence parsing rules. If you look at the parse forest that comes out of the GetParseForestRoot method call, it contains And nodes that represent alternative parses.

Try to use the grammar as is, but treat - as concatinations and see if the parse forest contains your parse.

You will need to look at the parse forest, if you try to turn the parse forest directly into a parse tree, the disambiguation algorithm will only select the first alternative at each forest And node.

ArsenShnurkov commented 7 years ago

concatenation will work like positive lookahead. negative lookahead for regex can be represented as positive lookahead. Positive lookahead for regex can be implemented with NFA/DFA. CFG + DFA = CFG. But both lookaheads are different from the operation I am asking. After some 2 days discussion in #marpa i now think that CYK algorithm is required for this instead of Earley.

patrickhuber commented 7 years ago

Ok, read the forum and have a better idea of what you are trying to do.

I'll have to think about this a bit and see how eto.parse implements it. Also need to read up on cyk.

ArsenShnurkov commented 7 years ago

https://en.wikipedia.org/wiki/Boolean_grammar https://en.wikipedia.org/wiki/Conjunctive_grammar 2001-04, Okhotin, Conjunctive grammars http://dl.acm.org/citation.cfm?id=543323

Boolean grammars extend the context-free grammars, with conjunction and negation operations. http://users.utu.fi/aleokh/papers/conj_bool_programming.pdf http://users.utu.fi/aleokh/papers/boolean_survey.pdf He has also implementation of parser in C++ - http://users.utu.fi/aleokh/whalecalf/

ArsenShnurkov commented 6 years ago

https://stackoverflow.com/questions/38544032/negation-in-thompsons-algorithm

https://en.wikipedia.org/wiki/Brzozowski_derivative http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.4378&rep=rep1&type=pdf

https://cs164fa09.pbworks.com/f/midterm1_solutions.pdf

a generalized regular expression denotes a possibly infinite set of finite-length strings of symbols from A. It may be built of:

∅ (denoting the empty set of strings), ε (denoting the singleton set containing just the empty string), a symbol a from A (denoting the singleton set containing the single-symbol string a), R∨S (where R and S are, in turn, generalized regular expressions; denoting their set's union), R∧S (denoting the intersection of R 's and S 's set), ¬R (denoting the complement of R 's set with respect to the set of all strings of symbols from A), RS (denoting the set of all possible concatenations of strings from R 's and S 's set), R* (denoting the set of n-fold repetitions of strings from R 's set, for any n≥0, including the empty > > string).

In an ordinary regular expression, neither ∧ nor ¬ is allowed.

So, I want 2 things: 1) support of '-' operation "R - S" in regular expressions part (as "R ∧ ¬S"); 2) support of except-symbol as in part 4 of ISO EBNF syntax standard in context-free part:

2 The normal character representing each operator of Extended BNF and its implied precedence is (highest precedence at the top): * repetition-symbol - except-symbol , concatenate-symbol | definition-separator-symbol = defining-symbol ; terminator-symbol

ArsenShnurkov commented 6 years ago

2017, Mirzakhmet Syzdykov, Deterministic automata for extended regular expressions

algorithms to produce deterministic finite automaton (DFA) for extended operators in regular expressions like intersection, subtraction and complement

ArsenShnurkov commented 6 years ago

two libraries for combining automatas: https://github.com/moodmosaic/Fare https://github.com/AutomataDotNet/Automata

ghost commented 3 years ago

Dear Arsen Shnurkov

Repository: https://github.com/mirzakhmets/regexplus

CodeProject article: https://www.codeproject.com/Tips/5280511/Regexplus-Extended-Regular-Expression-Engine

Your subtraction operator is present, along with intersection and complement

Regards, Mirzakhmet Syzdykov