kirienko / gourmet

Gourmet Recipe Manager
GNU General Public License v2.0
30 stars 14 forks source link

Replace regex parser with context free grammar parser #80

Open maweki opened 4 years ago

maweki commented 4 years ago

If you've looked at the offending code behind #79 , you'll see that the almagamation of regexes looks like an unmaintainable mess. Especially string-replacement in regex-strings and other combination mechanisms seem dangerous.

Now python regexes are not combinable by usual regex operation and I don't know of a performant regex engine that allows that.

A maintainable alternative could be using a parser/tokenizer combination out of the realm of context-free grammars. In my experience, they can be combined more easily and are therefore more maintainable. The result is a tree instead of a list, which is no problem. Some Tree-nodes we would replace by data constructors (like creating decimals from a comma-seperator and two number-nodes) and other parts of the tree would be flattened to a list, basically as it is now.

I think I have also more confidence in me/us writing/reading the simple regexes for tokens and a context-free grammar than whole regexes for complex expressions that span multiple lines and are later changed through string-processing functions.

In the longer run we could make chomsky proud and even supply some general CFGs for the supported languages and identify verbs, adjectives, and nouns through the grammar, giving us more confidence in extracting recipe steps.

I know it's not a priority but is this something we would put on the agenda?

kirienko commented 4 years ago

I cannot claim that CFG are more maintainable than regexes (the latter may indeed look messy, but intrinsically they are finite automata and therefore unambiguous). But being a theoretical physicist, I of course support all kinds of mathematical exercises and perversions :muscle:

But could you give an example of a really ugly regex? Maybe we can fix it without going into a higher dimensions.

maweki commented 4 years ago

but intrinsically they are finite automata and therefore unambiguous

I think we value regular languages for their composability (closure under (almost?) all operations). While this is true for regular languages, composability breaks when introducing backreferences and unnamed capture groups. This is why basically no regex library I have seen allows you to combine compiled regexes.

First of all, I want to you to take note of this stackoverflow article "How to combine multiple regex into single one in python?" https://stackoverflow.com/questions/42136040/how-to-combine-multiple-regex-into-single-one-in-python

The answer boils down to re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).findall(sentence) which is barely legible.

But could you give an example of a really ugly regex?

Again, I am more focused on the composability of a grammar that results in a parse tree which we can then analyze easier.

Example: https://github.com/kirienko/gourmet/blob/5d88f0a9373185b262acb237c9ece528b7c218ca/gourmet/convert.py#L748-L765

NUMBER_REGEXP = "[\d"
for k in list(UNICODE_FRACTIONS.keys()): NUMBER_REGEXP+=k
NUMBER_START_REGEXP = NUMBER_REGEXP + ']'
NUMBER_END_NO_RANGE_REGEXP = NUMBER_START_REGEXP # Is this right -- quick fix here.
NUMBER_MID_REGEXP = NUMBER_REGEXP + ',.' + SLASH
NUMBER_MID_NO_RANGE_REGEXP = NUMBER_MID_REGEXP + " /]"
NUMBER_MID_REGEXP += " /-"
NUMBER_MID_REGEXP += "]"
NUMBER_END_REGEXP = NUMBER_START_REGEXP
NUMBER_REGEXP = "("+NUMBER_START_REGEXP+"*"+NUMBER_MID_REGEXP+"*"+NUMBER_END_REGEXP
if NUMBER_WORD_REGEXP:
     NUMBER_REGEXP = NUMBER_REGEXP + '|' + NUMBER_WORD_REGEXP + ')'
     NUMBER_NO_RANGE_REGEXP = '(' + NUMBER_START_REGEXP + '+|' + NUMBER_WORD_REGEXP + ')'
else:
    NUMBER_REGEXP = NUMBER_REGEXP + ")"
    NUMBER_NO_RANGE_REGEXP = NUMBER_START_REGEXP + '+'
NUMBER_MATCHER = re.compile("^%s$"%NUMBER_REGEXP,re.UNICODE)

I would accept it, if it was only + of strings (even though this also breaks in the aforementioned cases), but the control characters are unnerving. As an alternative we could wrap the strings into some regex-compose-DSL but having something like NUMBER_START_REGEXP + '+|' + NUMBER_WORD_REGEXP with the + '+|' + is surely not what we want.

kirienko commented 4 years ago

I have to confess that I underestimated severity of the problem. I agree that the current code state is far from being maintainable. Maybe we can do better with CFG, I have close to zero practical experience with that and cannot judge. But I also can imagine that there might be a ready to use python solution for that somewhere, and it might be possible to eliminate the complexity from our code. We don't need to do everything by ourselves.

maweki commented 4 years ago

I am not advocating writing my own parser. But converting this regex-kerfuffle into some context-free grammar should be straightforward using the proper library.