Closed Felienne closed 5 months ago
Objectives
Current state
Lark's features Lark offers features that are relevant for the task at hand. It allows importing rules and terminals of a .lark file into another grammar via the %import directive. Note that all imported rules must be explicitly specified and they will be imported in a new namespace to avoid collisions. The %override and %extend directives can be used to adapt the imported grammar rules. Lark also offers a merge_transformers (https://lark-parser.readthedocs.io/en/latest/visitors.html?highlight=merge_transformers#merge-transformers) functionality which merges each transformer with a respective namespace.
How could we use Lark's features to reach the objectives We could remove the manual grammar merging by importing the the grammar of level n-1 in the level n .lark file. However, there are multiple things to note:
We could automatically detect if a user is using a command from a previous level that is no longer part of the current level. The easiest way to achieve this is to ensure that the grammar of every level includes the grammars of the previous levels and uses them with less priority than the grammar of the current level. For example:
start: program
program: _EOL* (command _EOL+)* command?
command: level2__command | level1__command | error__command
%import .level1.command -> level1__command
%import .level2.command -> level2__command
%import .error.command -> error__command
In this way, the echo command (which is a level 1 command that does not exist in level 2) will appear as a level1__command the parse tree.
A sample project containing 3 highly simplified Hedy levels is available here: https://github.com/boryanagoncharenko/grammar-composition-test
A couple of additional notes:
print this is a test
at level 4 (which requires quotes) would parse to a level 3 print command. The issue is not with the print command itself but with one of its children: it should be quoted text instead of text. Providing a good error message might be hard because we will need to inspect the print tree to figure out where exactly lies the problem.Update from a meeting with Felienne:
This task was redefined to provide better error messages in the case when users mix syntax from different levels. Currently, to provide meaningful error messages, we define possible mistakes in the grammar and then return the correct error messages in the transpiler. For example, in level 4 we specifically create grammar rules to handle the cases of print
command without quotes or with one quote only. We follow the same strategy for all common errors, e.g. the grammar has definitions for a repeat
command without the keyword times
, without the following command, and without a print
command.
Can we use Lark's grammar composition to create grammars that detect that syntax from a different level is used? Definitely no. The current grammar merging of Hedy has evolved to a point where using the composition feature is impractical. It is better to extend the custom grammar merging to match syntax from previous levels.
Can we extend the custom grammar merging to automatically detect that syntax from a different level is used?
Theoretically, yes. For every command that is defined in level N, we could add its definition from level N-1 with lower priority. This approach would only work on command level. In practice, injecting an old version of a rule into the grammar might lead to grammar ambiguities or incorrect priority of rules, which would be hard to debug. Because the new approach would need to coexist with other error definitions, different commands might require injecting in different places (e.g. if_less_command
) and we might end up with an over-fitted code that might be better defined manually in the grammars.
If we get the grammar merging sorted out, then the transpiler could provide a common message when the wrong-level-syntax is encountered:
def default_func(orig_level, self, args, meta):
raise Exception(f'This code works for level {orig_level} but you are in level {self.level} now.')
def hedy_transpiler(level):
def decorator(c):
TRANSPILER_LOOKUP[level] = c
current = inspect.getmembers(c, predicate=inspect.isfunction)
funcs = [v[0] for v in current if v[0][0] != '_' and v[1].__qualname__.startswith(f'ConvertToPython_{level}') or v[1].__qualname__.startswith(f'source_map_rule')]
TRANSPILER_FUNCTIONS[level] = funcs
if level > 1:
i = level-1 if level-1 in TRANSPILER_FUNCTIONS.keys() else level-2
names = [f for f in TRANSPILER_FUNCTIONS[level]]
to_add = [f for f in TRANSPILER_FUNCTIONS[i] if f in names and f'level{i}_{f}' not in names]
for f in to_add:
name = f'level{i}_{f}'
setattr(c, name, lambda self, args, meta: default_func(i, self, args, meta))
c.level = level
return c
return decorator
If we want to provide a custom error message, we would be able to override a rule as follows:
class ConvertToPython_8_9(ConvertToPython_7):
def level7_repeat(self, meta, args):
raise hedy.exceptions.WrongLevelException(7, 'repeat', "repeat_dep", meta.line)
In a nutshell, altering the grammar merging together with the transpiler code illustrated above add a significant layer of complexity while the benefits remain limited.
New Problem definition Given the above, it would be better to redefine the problem once more in the following way. In an ideal world:
Ideas
Obviously, the problem definition is very ambitious (and also needs to be refined). I have not investigated these options, so I do not know whether they are even possible.
I set off to explore fuzzy parsing, which is mentioned in point 1. in the previous comment. My concern was that if such a tool exists, it would not be in a production-ready state. So I employed a rather practical strategy: only search for tools and inspect them if they have a readme file, time-boxed to 2 hours. In short, I was not able to find any tool that achieves the goal and has any documentation. However, during my searching I noticed a couple of things:
Research papers define fuzzy parsers as syntax analyzers that perform analysis on incomplete or incorrect input, so such tools probably exist. However, I did not encounter any probably due to the searching strategy I used.
As discussed with Felienne, closing the ticket and hoping that it might be useful for someone in the future.
When we have merged #1314, shall we try to use grammar composition to merge levels instead of the hand-written merging? I am not 100% sure it will work, maybe we will get a bit of ambiguity but it will clean up the code a lot!
Ideally we might also be able to mix in rules from lower levels with a prefix automatically to catch level n rules at level n-1 so we can give better error messages (like we do manually now with _dep rules) Do you feel like diving into this @boryanagoncharenko?