Open zarnovican opened 1 year ago
The new release (1.1.8) should solve Earley non-determinism.
I looked a bit deeper, and this is happening because of this line in the Python grammar:
?string_concat: string+
If we remove the ?, the bug gets solved. Same if we remove the +.
It looks like a bug in the reconstructor, that fails to understand that string string
in the AST should never be interpreted as a single string_concat
, because then it would be already be a branch.
Short description
When running the example
examples/advanced/reconstruct_python.py
, the code fails onAssertionError
in approx 50% of cases.To Reproduce
Execute the example in a loop. "1" indicate AssertionError. "0" is success.
Test environment
OS: Linux Python: 3.11 Lark: 1.1.5
I have tested few older Pythons and older Larks as well. No change.
Long Description
As I understand the example, it is converting text Python to AST, then back to Python via "Reconstructor" and then, again to AST. The assertion is that the first and second AST trees should be the same.
After some debugging, I was able to isolate much smaller reproducer:
The above code is converting
foo('a', 'b')
to AST and back to python. The Reconstructor produces four variations of the codeNotice the missing comma "," between the two arguments.
'a''b'
are strings concatenated into one argument, while'a','b'
are two arguments. The same problem happens in the original example on line:Hence the
AssertionError
.Problem: Python grammar
I'm not experienced enough to understand how the Reconstructor works. Maybe the problem is ambiguity in the Python grammar defined in Lark. I cannot fathom, however, how could the Reconstructor arrive from a tree node "arguments" with two children to string concat :confused:
Problem: non-determism
My initial motivation for root-cause analysis was actually finding the source of non-determinism. Why, if I put the loop inside the Python process, it returns consistent results, while the loop outside the process differ ? I haven't found "the line" where it is happening. The closest I came is this line, where the output
unreduced_tree
from the parser is already "random", while the input is not AFAIK.In my opinion, the source of entrophy is Lark's usage of
set()
s. Every time the Python process is executed, the elements in sets are iterated over in a different order. This is expected behavior from Python point-of-view. But it has cascading effects on Larks processing, when, for example, rules are inspected in random order.:thinking: Maybe, one way of fixing it is to switch from
set()
todict()
and use some dummy value. AFAIK Python guarantees that the dict keys will be iterated in the same order.