[DISCARDED] RFC: WIP: Several improvements in error reporting

This patch is more advanced compared to https://github.com/textX/Arpeggio/pull/100

WIP checklist:

[ ] Fix weakly_failed_rules__.
[ ] Revert this line: # Unsuccessful match of the whole PE - full backtracking.

The NoMatch class is made smarter and tracks the exact match errors. With this design, reporting an error translates to printing the diagnostics of all failed subexpressions of a failed expression.

For example, for Sequence, the SequenceNoMatch class collects the failed expression as well as all previously matched (but failed) Optional expressions.

The naming of *NoMatch certainly introduces some redundancy in the class names, however, without this measure I had hard time visualizing the internal structure of the failed expression to be printed.

Open question:

Is this desirable that in case of ordered choices, we could print not only the winning match but all matches that are equally possible, as in this example:

assert (
    "Expected '4' at position (1, 15)"
    " or "
    "'five' at position (1, 20)" in str(e.value)
 )

The advantage of this approach would be that a user sees all alternatives that are possible in cases like this. The cost: the implementation has to control whether several at position strings have to be presented instead of a single at position at the end of the inspection message.

Code review checklist

[ ] Pull request represents a single change (i.e. not fixing disparate/unrelated things in a single PR)
[ ] Title summarizes what is changing
[ ] Commit messages are meaningful (see this for details)
[ ] Tests have been included and/or updated
[ ] Docstrings have been included and/or updated, as appropriate
[ ] Standalone docs have been updated accordingly
[ ] Changelog(s) has/have been updated, as needed (see CHANGELOG.md, no need to update for typo fixes and such).

The tests are passing in this branch, but the work is not finished yet. I would like to collect feedback before I rebase and finalize all of the WIP and Question: parts.

The commit is rebased now against the latest commits made to the master branch, and the tests are 99% passing (see the WIP about \n\n newlines that I could not figure out yet).

There are a few places where I am slightly reverting the idea of this commit: https://github.com/textX/Arpeggio/commit/db503c961bb7b4da2eddaf1615f66a193a6f9737. This is only due to the fact that my diagnostics implementation had been implemented with the previous design in mind, before these commits appeared. At the same time, I am still not convinced that the newly introduced behavior of the Optionals would not confuse the users because it is already quite confusing to myself. Having this said, I totally support the idea of Arpeggio reporting on the "pathological" grammar cases like you explained it in your comments.

I have left several WIP: and Question: marks, and I don't consider any of them to be a blocker on my end: I only need to get your Thoughts/Go/No-Go regarding these points. For one thing, I am returning the result of matching Optional as tuple and then in a few places, I have to do the extra if isinstance(result, tuple) check. I would be curious if there was a cleaner way to implement this.

Aside from the discussion points, please follow how much better / more precise the reporting became in many (most?) cases. The previous implementation of _nm_raise was quite difficult to understand and debug, and I could not find an alternative way of doing what I did. I also think that with the *NoMatch classes setup it could be a little easier to eventually implement a deterministic printing of the TextX rule annotations as I am suggesting here: https://github.com/textX/textX/issues/398.

@igordejanovic, I have done testing of my branch and came to the following conclusions:

1) I had to make a few changes in StrictDoc's grammar to adjust it to the latest Arpeggio master branch, see https://github.com/strictdoc-project/strictdoc/pull/1102. I consider that this change made me make the grammar a little healthier 👍

2) I think, I have released a Jinn from the bottle. Indeed, without trimming/flushing, the reporting becomes very detailed because it can go far back from the very failed expression.

At first, I got scared, and then I thought that this feature is something that is extremely nice to have because it educates the users about how my grammar can be used. Sometimes, the weakly failed expressions are not sorted according to their position, this is something that I could work on next. Very often, several failed expressions are reported on the same line, it should be trivial to cluster them to be on the same line.

textx.exceptions.TextXSyntaxError: /Users/Stanislaw/workspace/projects/strictdoc-project/strictdoc/docs/strictdoc_30_credits.sdoc:6:1: Expected the following alternative inputs:
4:1: 'VERSION: '
4:1: 'CLASSIFICATION: '
4:1: 'OPTIONS:'
5:1: '[GRAMMAR]'
5:1: '[FREETEXT]'
5:1: '[SECTION]' or '[' or '[COMPOSITE_' or '[FRAGMENT_FROM_FILE]'
6:1: EOF
 => 'What if   *[FREETEXT]'
error: Parallelizer: One of the child processes has exited prematurely.

3) Currently, the actual expression that triggered the final exception is not marked as such. This is something that I can also do next. Also, the reporting of the actual position does not always work correctly for all cases – I know where to fix it, just a matter of another day.

4) Overall, this code might need flushing or reducing the final printed alternatives to something like "the last 10 inputs lines".

5) I can see this behavior of being optional. Without a special option, Arpeggio would only give the actual failed option, without the whole trail of weakly failed rules. With the current design of the composable NoMatch classes, this could be achieved by reporting only the failed option itself, plus one or two levels of the surrounding failed expressions above it.

6) I have checked the performance on my StrictDoc integration test suite. It has 158 tests which do more than just textX, but all of them do use it. I observe the following difference: 110s for Arpeggio master branch, 120s for my branch. This is very raw, of course.

It would be great if you could try this branch yourself. I am curious to know if you see a good potential in this to be developed further.

P.S. Given this increased visibility into the weakly failed rules, I also pay more attention to when I see more obscure positions reported about the available options. In particular, there are cases, when I have [SECTION] and [/SECTION] opening and closing tags and if I break [/SEC_TION], I start getting obscure suggestions from textX/Arpeggio - the parser does not point me to the SEC_ mistake but shows something else. This is something that I would like to focus on in a separate issue because it looks like unrelated to my reporting branch. It also might have to do with the design of my grammar, so I will investigate this closer.

After some tinkering with formatting in this branch (it is on top of this PR), I got this:

textx.exceptions.TextXSyntaxError: /Users/Stanislaw/workspace/projects/strictdoc-project/strictdoc/docs/strictdoc_30_credits.sdoc:94:2: Expected the following alternative inputs:
(3, 1)    'UID: '
(3, 1)    'VERSION: '
(3, 1)    'CLASSIFICATION: '
(3, 1)    'OPTIONS:'
(4, 1)    '[GRAMMAR]'
(15, 1)   '[FREETEXT]'
(79, 1)   EOF
(79, 1)   '[COMPOSITE_'
(79, 1)   '[FRAGMENT_FROM_FILE]'
(80, 1)   'UID: '
(80, 1)   'LEVEL: '
(80, 36)  /./
(94, 1)   '[/SECTION]'
(94, 1)   '[FREETEXT]'
(94, 1)   '[SECTION]'
(94, 1)   '[COMPOSITE_'
(94, 1)   '[FRAGMENT_FROM_FILE]'
(94, 2)   /[A-Z]+(_[A-Z]+)*/
 => 'EETEXT]  [*/SCTION] '

Here are a few more observations:

I tested SDoc documents independently with and without this branch, and it looks like the performance degrades quite noticeably. I still don't have a good number, but it can be 25%+ which is a no-go for the original strength of PEG of being very fast. I think most of the time is spent on constructing composite *NoMatch classes as well as capturing the weakly failed rules for all expressions even if they are successful and the parser moves further after them.

With this in mind, I tried to imagine how this branch would work under a config option, such as --verbose. I think, it is not trivial to combine the design of _nm_raise with the design of this branch where:

All exceptions are a composite class of the very failed expression up to the failed parent expression (*NoMatch hierarchy)
Weakly failed rules that get accumulated as the input is being parsed.

I have spent quite some time working on this feature and I really like it, but I don't see an easy way to combine the performance feature and the level of detail given by this branch. With this said, please don't feel forced to integrate it at all cost just because I am an active user. I myself certainly want a solution that does not degrade the nominal performance of Arpeggio. I very interested in both having Arpeggio to work as fast as possible and at the same time having an option to make it give more detailed diagnostics like what this branch can do.

I would be curious to know if:

1) You find that this more detailed reporting feature makes sense, and we should find a way to integrate it without compromising Arpeggio's performance. 2) You have some insights of how this code could be synthesized with the master branch to allow the detailed diagnostics to work under a config parameter.

@stanislaw I did a benchmark test with textX with both current Arpeggio master branch and this PR. The results are quite discouraging. This branch is about 2x slower than the current master. Also, memory consumption is about 2x increased. Definitely a no-go.

I've been thinking about this detailed report and I must say that while it looked promising at first I'm not sure anymore that it would be easy to grasp for the end-user and that it would bring any benefit. For the developer it always give more info running the parser in debug mode. We could investigate possibility of keeping a limited backtrace of weak failures. While that could make memory usage profile better, still I don't expect that it would make speed degradation any better.

Also, trying to keep both nm_raise and this PR approach and making a switch in my view would impede maintability and simple design of Arpeggio.

I understand that you have spent quite some time investigating this approach and I thank you for that. The parsing is not easy. I have also been down that rabbit hole myself of trying to change the design in the hope of getting better at some aspect but the constraint of keeping everything both performant and simple was often got in my way making me retreat.

I think that we should also investigate extending nm_raise approach to get more intelligent error reports. Probably by sticking at the "report failure that happen furthest down the stream" approach as the easiest to understand.

I don't want to discourage you from continuing with this direction but I don't see at the moment in what direction to go to improve on the current master.

@igordejanovic

I literally agree with every paragraph and sentence from your reply. I didn't realize, I had access to the same performance tools that you have. I have run (cd perf-tests/; ./run_speed.sh), and indeed the results are very clearly no-go.

While I consider this branch to be a failure, I certainly learned a couple of things about what Arpeggio does, and before we come back to _nm_raise, I got another idea that I want to check with you. I also thought about what are the performance eaters in my branch and in general, so here are some shots into the air:

What if we would introduce something similar to what _nm_raise does but make any weakly failed exception to be logged to a fast circular buffer of a configurable length, see deque example as reference? Instead of collecting composite collections of NoMatch classes in the weakly_failed_rules(all these weakly failed rules arrays of this PR#102 branch do not exist with this approach), every exception that reaches nm_raise is written to the circular buffer right away (possibly without being propagated up, if we can optimize this way), and the final Expected ... string construction happens as a synthesis of all weakly failed rules previously recorded in the buffer + the actual failed exception. During the synthesis, the rules are flattened and sorted according to the position as needed. The default behavior gives a window of, let's say 10 lines, and an extended option allows to increase a window for a broader context.

If the circular buffer approach was followed, another building block could be an object pool implementation (this is just what googles first: https://github.com/T-baby/pondpond). If the NoMatch objects from the circular buffer are recycled back into the object pool, you don't have to endlessly create NoMatch objects over and over again. I could imagine that this could give you a few extra bits of performance even within the current master branch. Do you actually have an idea what is an average allocation ratio for an arbitrary grammar/input? Do they happen often with _nm_raise approach?

I was actually spending some time playing around with a broader diagnostics of this branch when working against StrictDoc documents, and I am quite convinced that the advanced diagnostics output could be optimized/presented visually in a form of a tutorial-like experience as you work against a given grammar.

Let me know if the above makes sense anyhow. I am very much ready to close this branch and move on but I think there is some opportunity for more precise diagnostics/reporting that I want to double-check before giving up.

While I consider this branch to be a failure

Having a research background I would definitely not call this a failure. You have investigated a new approach and we all learned something new.

In textX perf tests there is also a profiling report in the form of pdf. Look at the benchmark branch. This can give you a clue where the time is spent the most.

I guess having a nm_raise with a circular buffer of NoMatch exceptions which progress as the input is consumed might do the trick. We will keep the simple approach of phased NoMatch construction while still keeping several previous exceptions at previous locations which were not risen as a new path is found. I can see that this can easily be configurable so, for example, in debug mode the deque would be larger to give more information while in production mode the deque length could effectively be 1 which would give the current behavior, or we could have some greater value which gives good results. With this approach I think that both speed and memory profile should stay about the same.

Do you actually have an idea what is an average allocation ratio for an arbitrary grammar/input? Do they happen often with _nm_raise approach?

I don't have figures. We could find out by introducing counter in nm_raise. My gut feeling tells me that it is a lot. As for every optional, repetition etc. parser must try to parse the input and the week failure will happen. If we create a new *NoMatch object for every failure it quickly adds up to both slow down and memory consumption. Thus, nm_raise avoids creating new NoMatch if we have not progressed in the input.

While I consider this branch to be a failure

Having a research background I would definitely not call this a failure. You have investigated a new approach and we all learned something new.

Definitely yes! I would be ready to continue digging, if a promising direction was found 😄

In textX perf tests there is also a profiling report in the form of pdf. Look at the benchmark branch. This can give you a clue where the time is spent the most.

I am taking a note and will take a look.

I guess having a nm_raise with a circular buffer of NoMatch exceptions which progress as the input is consumed might do the trick. We will keep the simple approach of phased NoMatch construction while still keeping several previous exceptions at previous locations which were not risen as a new path is found. I can see that this can easily be configurable so, for example, in debug mode the deque would be larger to give more information while in production mode the deque length could effectively be 1 which would give the current behavior, or we could have some greater value which gives good results. With this approach I think that both speed and memory profile should stay about the same.

I have tried this already in this branch. The circular buffer is right in nm_raise with no changes made to the existing logic. The only change is that the printed diagnostic looks over the circular buffer.

I have hit another problem with this approach, though. If everything is written to the buffer, then during the report time, also the exceptions that were eventually matched on a higher level are printed.

A simple example:

def test_ordered_choice():

    def grammar():
        return ["a", "b", "c"], EOF

    parser = ParserPython(grammar)

    with pytest.raises(NoMatch) as e:
        parser.parse("bb")
    assert (
       "Expected EOF at position (1, 2) => 'b*b'."
    ) == str(e.value)

This test fails because the successful 'a' alternative from the ordered choice is printed as well. I say that 'a' is "successful" because 'b' is successful and therefore the whole parent OrderedChoice is successful.

"Expected 'a' at position (1, 1) or EOF at position (1, 2) => '*bb'."

Which leads me to a question that either:

a) This circular buffer has to be "garbage-collected" from the successfully failed rules when they were successful because of their parent expression. I have tried this for the OrderedChoice: the test passes but the code looks bulky and is very likely to eat performance, see here. b) A more fine-grained approach of populating the buffer is to be found. The level of nm_raise is too low, so there I cannot decide whether a parent expression of the current exception will eventually fail or not. An alternative here is (only a guess) that every non-Match class, such as OrderedChoice would write to the buffer the weakly failed rules, and the nm_raise will not have any references to the buffer.

Let me know if you see anything promising in this direction. After all, I feel like I am stepping into the parsers territory with almost no theoretical background, so I am moving using try/error, not using educated guesses 😄

Closing this in favor of the new one #114.

textX / Arpeggio

[DISCARDED] RFC: WIP: Several improvements in error reporting #102

Code review checklist