Big performance hit after switch from `Antlr4.CodeGenerator` to `Antlr4BuildTasks` and `Antlr4.Runtime.Standard`

GeorgDangl commented 1 year ago

Hey @kaby76,

I've just recently discovered your work around ANTLR and C#, and I'm impressed! Thank you very much, first of all.

So I was pretty excited to be able to move to the newer targets, and had good results in some project. However, I had a very big performance regression in one grammar, and I can't understand why.

The grammar is pretty simple. Essentially, some texts can have square brackets denoting some placeholders, which we try to parse so that we can build a data model around it.

The grammar looks like this:

grammar GAEB2000PlainTextTextAdditions;

@parser::members
{
    protected const int EOF = Eof;
}

@lexer::members
{
    protected const int EOF = Eof;
    protected const int HIDDEN = Hidden;
}

/*
 * Parser Rules
 */

text         : ( textAddition | plainText | brackets )* compileUnit ;
textAddition : OpenBrack (buyer=TA | bidder=TB) identifier+=Digit+ content=textContent CloseBrack ;
plainText    : ( Digit | Text | textAddLike )+ ;
textAddLike  : TA | TB ;
brackets     : OpenBrack | CloseBrack ;
textContent  : heading=plainText? OpenBrack body=plainText CloseBrack tail=plainText? ;
compileUnit    :    EOF ;

/*
 * Lexer Rules
 */

TA          : 'TA'  ;
TB          : 'TB'  ;
OpenBrack   : '['   ;
CloseBrack  : ']'   ;
Digit       : [0-9] ;
Text        : .     ;

Now, we're building a Lexer and a Parser from it and try it out it with the following test:

        [Fact]
        public void CheckAntlrPerformance()
        {
            var inputString = @"There is some text before

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

[TA11Offered Type: [....................................................] 
]

Location   : [TA12[.......]]";

            for (var i = 0; i < 10_000; i++)
            {
                PerformAntlrParsing(inputString);
            }
        }

        private void PerformAntlrParsing(string inputString)
        {
            var inputStream = new AntlrInputStream(inputString);
            var lexer = new GAEB2000PlainTextTextAdditionsLexer(inputStream);
            var tokenStream = new CommonTokenStream(lexer);
            var parser = new GAEB2000PlainTextTextAdditionsParser(tokenStream);

            parser.Interpreter.PredictionMode = PredictionMode.SLL;
            TextContext result;
            result = parser.text();
        }

I was hoping for a bit of a performance improvement, but it actually went from around 2.4 seconds to 29.2 seconds, so roughly an order of magnitude slower. Profiling a few runs doesn't give me a lot of information. The memory allocation is pretty small throughout the test, and there's not a lot of GCs happening either. I see this after a few thousand iterations:

So, I'm a bit at a loss here how to proceed. I've checked the grammar also with tranalyze, but if I understand it correctly, the output is just statistics or telling me a rule is fine (NotEmpty):

trparse GAEB2000PlainTextTextAdditions.g4 | tranalyze -s text
7 occurrences of Antlr - nonterminal def
23 occurrences of Antlr - nonterminal ref
6 occurrences of Antlr - terminal def
3 occurrences of Antlr - keyword
5 occurrences of Antlr - literal
Rule text is NonEmpty
Rule textAddition is NonEmpty
Rule plainText is NonEmpty
Rule textAddLike is NonEmpty
Rule brackets is NonEmpty
Rule textContent is NonEmpty
Rule compileUnit is NonEmpty
Rule TA is NonEmpty
Rule TB is NonEmpty
Rule OpenBrack is NonEmpty
Rule CloseBrack is NonEmpty
Rule Digit is NonEmpty
Rule Text is NonEmpty

For reference, the strings we usually encounter in the wild are at most on the order of a kilobyte size or so, and usually just contain one or very few such tags.

kaby76 commented 1 year ago

I'm able to reproduce the perf issue. Yes, Antlr4cs is faster.

(Note, Antlr4cs keeps a write lock on a file somewhere, because the template driver code fails to open the file the second time here. The Antlr4.Runtime.Standard code does not write lock the file with identical code. I got around the problem by placing the input in unique files, e.g., ./bin/Debug/net6.0/Test.exe input1.txt input2.txt input3.txt input4.txt input5.txt input6.txt input7.txt input8.txt input9.txt input10.txt input11.txt input12.txt input13.txt input14.txt input15.txt input16.txt input17.txt input18.txt input19.txt input20.txt input21.txt input22.txt input23.txt input24.txt input25.txt input26.txt input27.txt input28.txt input29.txt input30.txt input31.txt input32.txt input33.txt input34.txt input35.txt input36.txt input37.txt input38.txt input39.txt input40.txt input41.txt input42.txt input43.txt input44.txt input45.txt input46.txt input47.txt input48.txt input49.txt input50.txt input51.txt input52.txt input53.txt input54.txt input55.txt input56.txt input57.txt input58.txt input59.txt input60.txt input61.txt input62.txt input63.txt input64.txt input65.txt input66.txt input67.txt input68.txt input69.txt input70.txt input71.txt input72.txt input73.txt input74.txt input75.txt input76.txt input77.txt input78.txt input79.txt input80.txt input81.txt input82.txt input83.txt input84.txt input85.txt input86.txt input87.txt input88.txt input89.txt input90.txt input91.txt input92.txt input93.txt input94.txt input95.txt input96.txt input97.txt input98.txt input99.txt input100.txt)

I do see is that this grammar contains a lot of ambiguity, and does a tremendous amount of fall-back with max-k the size of the file. That's a grammar problem, but maybe there's something that Antlr4cs does better on fall backs?

$ trperf input.txt | column -t
Time to parse: 00:00:00.0743970
Decision  Rule          Invocations  Time      Total-k  Max-k  Fallback  Ambiguities  Errors
0         text_         4            0.095922  182      76     2         2            106
1         text_         0            0         0        0      0         0            0
2         textAddition  0            0         0        0      0         0            0
3         textAddition  4            0.001372  8        2      2         2            7
4         plainText     0            0         0        0      0         0            0
5         plainText     579          0.437253  119438   483    573       501          584
6         textContent   0            0         0        0      0         0            0
7         textContent   0            0         0        0      0         0            0

Could you reraise the issue in https://github.com/antlr/antlr4/issues? It's not an Antlr4BUildTasks problem. That's sort of like blaming MSBuild or dotnet.exe or make for the issue, whose purpose is just to build an executable. It will have a wider audience. @parrt should see this.

_(More info: The Java target does not exhibit this problem. I wonder if caching of prediction context is working in Java, but not in CSharp so well.)

GeorgDangl commented 1 year ago

Oh, thank you very much for the quick reply! I'll post the issue over there😀

About the grammar having ambiguity, do you happen to have some suggestion there, or is there some documentation I can read about that?

GeorgDangl commented 1 year ago

Opened a new issue in the antlr organization.

kaby76 / Antlr4BuildTasks

Big performance hit after switch from `Antlr4.CodeGenerator` to `Antlr4BuildTasks` and `Antlr4.Runtime.Standard` #52