Explore fuzzy parsing - Githubissues

When we have merged #1314, shall we try to use grammar composition to merge levels instead of the hand-written merging? I am not 100% sure it will work, maybe we will get a bit of ambiguity but it will clean up the code a lot!

Ideally we might also be able to mix in rules from lower levels with a prefix automatically to catch level n rules at level n-1 so we can give better error messages (like we do manually now with _dep rules) Do you feel like diving into this @boryanagoncharenko?

Objectives

Can we use Lark's features to remove the manual grammar merging done in Hedy (https://github.com/Felienne/hedy/blob/main/hedy.py#L1763)?
Can we use Lark's features to automatically detect that a user is using the syntax of level n-1 at level n? This is done manually (and not extensively) with the _dep rules (https://github.com/Felienne/hedy/blob/main/hedy.py#L1209-L1216)?

Current state

Because of the gradual nature of Hedy, each level defines a new language which is only slightly different from the previous level. To avoid massive repetition, the current implementation defines a separate .lark file for each level which contains only the syntax changes from the previous level. These .lark files are manually merged into *-Total.lark files that hold the full grammar of each level. A transformer which transpiles to python exists for every level, and each transformer inherits the transformer of the previous level.
It is critical for Hedy to provide good error messages in the cases when a user attempts to use syntax from level n-1 in level n, when this syntax is no longer supported. Currently this implementation is added manually through extending the grammar of level n to allow for level n-1 syntax and throwing WrongLevelException if such invalid nodes appear in the AST. Because this is done manually, the WrongLevel errors are not extensive. As a result, when a command is no longer supported, the user will get an unclear error (e.g. the turn and forward commands work in level 10 but cause a parse error in level 11).

Lark's features Lark offers features that are relevant for the task at hand. It allows importing rules and terminals of a .lark file into another grammar via the %import directive. Note that all imported rules must be explicitly specified and they will be imported in a new namespace to avoid collisions. The %override and %extend directives can be used to adapt the imported grammar rules. Lark also offers a merge_transformers (https://lark-parser.readthedocs.io/en/latest/visitors.html?highlight=merge_transformers#merge-transformers) functionality which merges each transformer with a respective namespace.

How could we use Lark's features to reach the objectives We could remove the manual grammar merging by importing the the grammar of level n-1 in the level n .lark file. However, there are multiple things to note:

The %import directive in Lark requires explicitly specifying each rule or terminal that needs to be imported. This might make the import clauses long and unreadable.
The grammar merging of Hedy has some neat features such as the -= and > operators. These do not have an equivalent in Lark.
All imported rules and terminals appear in a separate namespace. This means that if the TURN command is defined in level 1 and is imported in level 2, it will become level1turn. And if when level 2 grammar is imported in level 3, the turn rule will match level2level1__turn and so on. We will use merge_transformers with the appropriate namespace, but if we nest a large number of levels, errors while developing level 20 might become hard follow.

We could automatically detect if a user is using a command from a previous level that is no longer part of the current level. The easiest way to achieve this is to ensure that the grammar of every level includes the grammars of the previous levels and uses them with less priority than the grammar of the current level. For example:

start: program
program: _EOL* (command _EOL+)* command?
command: level2__command | level1__command | error__command

%import .level1.command -> level1__command
%import .level2.command -> level2__command
%import .error.command -> error__command

In this way, the echo command (which is a level 1 command that does not exist in level 2) will appear as a level1__command the parse tree.

A sample project containing 3 highly simplified Hedy levels is available here: https://github.com/boryanagoncharenko/grammar-composition-test

A couple of additional notes:

Hedy supports keywords in different languages. The keywords are not defined as a separate .lark file and are used to override the keywords defined in level1.lark using the current grammar merging mechanism. If we start using Lark for grammar merging, we will still need to manually merge the keywords translation.
The current example differentiates code from different levels at command level. Running print this is a test at level 4 (which requires quotes) would parse to a level 3 print command. The issue is not with the print command itself but with one of its children: it should be quoted text instead of text. Providing a good error message might be hard because we will need to inspect the print tree to figure out where exactly lies the problem.

Update from a meeting with Felienne:

Even though it would be great to not maintain code for manual grammar merging, migrating to Lark's directives (import + override) does not guarantee better maintenance. In fact, it looks like it will increase the complexity of the merged transformers code. So a decision has been taken to keep the current manual grammar merging.
Having users mix code from different levels is a problem that happens and we would like to tackle it with Lark's grammar composition. Currently we can mix at command level but it would be better ti mix at a finer level. Further investigation is needed in this direction.

This task was redefined to provide better error messages in the case when users mix syntax from different levels. Currently, to provide meaningful error messages, we define possible mistakes in the grammar and then return the correct error messages in the transpiler. For example, in level 4 we specifically create grammar rules to handle the cases of print command without quotes or with one quote only. We follow the same strategy for all common errors, e.g. the grammar has definitions for a repeat command without the keyword times, without the following command, and without a print command.

Can we use Lark's grammar composition to create grammars that detect that syntax from a different level is used? Definitely no. The current grammar merging of Hedy has evolved to a point where using the composition feature is impractical. It is better to extend the custom grammar merging to match syntax from previous levels.

Can we extend the custom grammar merging to automatically detect that syntax from a different level is used? Theoretically, yes. For every command that is defined in level N, we could add its definition from level N-1 with lower priority. This approach would only work on command level. In practice, injecting an old version of a rule into the grammar might lead to grammar ambiguities or incorrect priority of rules, which would be hard to debug. Because the new approach would need to coexist with other error definitions, different commands might require injecting in different places (e.g. if_less_command) and we might end up with an over-fitted code that might be better defined manually in the grammars.

If we get the grammar merging sorted out, then the transpiler could provide a common message when the wrong-level-syntax is encountered:

def default_func(orig_level, self, args, meta):
    raise Exception(f'This code works for level {orig_level} but you are in level {self.level} now.')

def hedy_transpiler(level):
    def decorator(c):
        TRANSPILER_LOOKUP[level] = c

        current = inspect.getmembers(c, predicate=inspect.isfunction)
        funcs = [v[0] for v in current if v[0][0] != '_' and v[1].__qualname__.startswith(f'ConvertToPython_{level}') or v[1].__qualname__.startswith(f'source_map_rule')]
        TRANSPILER_FUNCTIONS[level] = funcs
        if level > 1:
            i = level-1 if level-1 in TRANSPILER_FUNCTIONS.keys() else level-2
            names = [f for f in TRANSPILER_FUNCTIONS[level]]
            to_add = [f for f in TRANSPILER_FUNCTIONS[i] if f in names and f'level{i}_{f}' not in names]
            for f in to_add:
                name = f'level{i}_{f}'
                setattr(c, name, lambda self, args, meta: default_func(i, self, args, meta))

        c.level = level
        return c

    return decorator

If we want to provide a custom error message, we would be able to override a rule as follows:

class ConvertToPython_8_9(ConvertToPython_7):
    def level7_repeat(self, meta, args):
        raise hedy.exceptions.WrongLevelException(7, 'repeat', "repeat_dep", meta.line)

In a nutshell, altering the grammar merging together with the transpiler code illustrated above add a significant layer of complexity while the benefits remain limited.

New Problem definition Given the above, it would be better to redefine the problem once more in the following way. In an ideal world:

We would have grammar definitions which define only the correct syntax. This means that we would not describe in the grammars specific syntax errors, e.g. a syntax from a previous level, a repeat command missing times keyword, or a print command being misspelled as pront.
We would have a way of correcting erroneous Hedy programs. A corrected program is a program that does not produce an error and is most similar to the input program.

Ideas

Obviously, the problem definition is very ambitious (and also needs to be refined). I have not investigated these options, so I do not know whether they are even possible.

Probably a wild idea but we could use fuzzy parsing? Instead of creating a parse tree when the input matches exactly the grammar and producing a parse error otherwise, the parser could return parse trees together with their degree of correctness. Fixing errors when syntax from previous levels is used might still require grammar merging but this is complexity we already maintain.
Alternatively, (as suggested by @TiBiBa) we could have a database of correct programs and could try to match an erroneous program with the correct items in it. If the database contains enough data and we find a good way to normalize the programs (e.g. so that variable names do not matter), we could try to correct the program by finding the most similar entry in the database. If the input is an unparsable program (which I assume will be the most frequent case), we might need to compare the programs in terms of text (as opposed to a parsed tree).
During the meeting we also touched the idea of creating and training a model that corrects a Hedy program. My preliminary research shows that the problem at hand would not be solved by the readily available algorithms. To follow this route, we would need expertise and financing that we currently lack.

I set off to explore fuzzy parsing, which is mentioned in point 1. in the previous comment. My concern was that if such a tool exists, it would not be in a production-ready state. So I employed a rather practical strategy: only search for tools and inspect them if they have a readme file, time-boxed to 2 hours. In short, I was not able to find any tool that achieves the goal and has any documentation. However, during my searching I noticed a couple of things:

There are many well-maintained fuzzy parsers which do an excellent job for a specific domain, e.g. parsing dates, addresses, phone numbers.
To my surprise, multiple tools are labeled as "fuzzy parsers" even though they are fuzzers that can also parse input. These are used in the domain of software testing to generate correct, incorrect or partially incorrect input.
There are many well-maintained tools that perform a fuzzy search for a word/phrase in a text, which includes approximate string matching. Perhaps the possible next step would be to explore this option.

Research papers define fuzzy parsers as syntax analyzers that perform analysis on incomplete or incorrect input, so such tools probably exist. However, I did not encounter any probably due to the searching strategy I used.

As discussed with Felienne, closing the ticket and hoping that it might be useful for someone in the future.

hedyorg / hedy

Explore fuzzy parsing #2048