tunnelvisionlabs / antlr4cs

The original, highly-optimized C# Target for ANTLR 4
Other
448 stars 103 forks source link

C# compiler can't compile generated lexer #72

Open EvgeniyKo opened 10 years ago

EvgeniyKo commented 10 years ago

CSC gives me following error without name of the file, line and column, only project name:

Error   4   An expression is too long or complex to compile

I found the reason for this error:

Unfortunately, there's not much we can do here without isolating the issue further. Usually, this error results from a deeply chained expression in your source, such as a long chain of string concatenations ("aaaa" + "bbbb" + ...). Such expressions are parsed and bound recursively in the compiler, and there is a depth at which the compiler will run out of stack space to continue parsing. The exact expression complexity supported depends on the data stashed in the compiler's stack frames, which can vary subtly between updates - if an expression in your solution was right at the boundary before, it may have tipped over.

https://connect.microsoft.com/VisualStudio/feedback/details/785173/got-error-cs1647-an-expression-is-too-long-or-complex-to-compile-in-vs2012

There is a huge string in generated lexer, over 5000 lines, which looks like this:

    public static readonly string _serializedATN =
        "\x3\xAF6F\x8320\x479D\xB75C\x4880\x1605\x191C\xAB37\x2\x37D\x2D79\b\x1"+
        "\x4\x2\t\x2\x4\x3\t\x3\x4\x4\t\x4\x4\x5\t\x5\x4\x6\t\x6\x4\a\t\a\x4\b"+
        "\t\b\x4\t\t\t\x4\n\t\n\x4\v\t\v\x4\f\t\f\x4\r\t\r\x4\xE\t\xE\x4\xF\t\xF"+
        "\x4\x10\t\x10\x4\x11\t\x11\x4\x12\t\x12\x4\x13\t\x13\x4\x14\t\x14\x4\x15"+
        "\t\x15\x4\x16\t\x16\x4\x17\t\x17\x4\x18\t\x18\x4\x19\t\x19\x4\x1A\t\x1A"+
        "\x4\x1B\t\x1B\x4\x1C\t\x1C\x4\x1D\t\x1D\x4\x1E\t\x1E\x4\x1F\t\x1F\x4 "+
        "\t \x4!\t!\x4\"\t\"\x4#\t#\x4$\t$\x4%\t%\x4&\t&\x4\'\t\'\x4(\t(\x4)\t"+
............

Any workaround would be very helpful.

sharwell commented 10 years ago

I believe the only reliable solution to this would be emitting the serialized ATN as an embedded resource rather than including it as a string. It's certainly achievable during the build process, but I haven't looked into the specifics.

Considering I've never heard of someone encountering this error with the C# target, can you give some specifics about the size of your lexer? If possible, could you send me a copy of it for further analysis?

EvgeniyKo commented 10 years ago

lexer is quite small, but there are lots of keywords. Generated lexer size 405 KB

Unfortunately, I can't send grammar to you because of my boss. If you want I can send you the generated lexer.

DanaNJW commented 10 years ago

I have a .g4 that reproduces this problem in in VS 2010. I can send directly to you to diagnose/fix this issue, I don't think it will be ok to post it publicly. What's the best way to get it to you without making it publicly available?

sharwell commented 10 years ago

An email address is associated with the Tunnel Vision Laboratories organization here. You can send it to that address. https://github.com/tunnelvisionlabs

EvgeniyKo commented 10 years ago

Now i've got the same issue in parser. Shall I write another bug?

sharwell commented 10 years ago

No it's the same issue. Here are some potential ways to resolve this:

  1. Use the Roslyn compiler toolchain which is already available as a preview for Visual Studio 2013 and will eventually become the standard C# compiler starting with Visual Studio 14. It either does not have the same limitation regarding large grammars (or the limit is substantially increased to the point that I can no longer reproduce it).
  2. Rather than emit a string literal for the serialized ATN in the source code, emit a binary file containing the raw binary data for the serialized ATN and update the MSBuild steps to automatically embed the file in the compiled assembly. The declaration for the _ATN field in the parser would then be updated to load the data from the embedded resource instead of from a string literal.
  3. Break the serialized ATN string into segments with a maximum size, similar to how the Java target works. Rather than one string with 5000 + operators, you might have 5 strings with 1000 + operators each (the recursion depth in the compiler is bounded by the number of operators in a single expression). The big difference is the Java target's string limit is actually based on a clear definition of limits in the class file format used by the JVM, so there's no question where the limit needs to be in order to ensure all grammars work properly. In the C# target, the limit is an arbitrarily imposed limit which is neither documented nor allowed by the language specification.

Considering that the first item is already available and that the overall limit imposed by the earlier compilers is much higher than seen in the Java target, I'm inclined to not make any changes (at least for the time being).

EvgeniyKo commented 10 years ago

The first option is not possible, because every developer must install the Roslyn compiler, I have to update a build on the build machine, testers must begin smoke testing. All because of one file.

I really like the second option. Can Antlr4 generate the binary file with the serialized ATN?

sharwell commented 10 years ago

Embedding a binary resource: Not currently supported; it would be a completely new feature requiring changes to the tool, code generation templates, MSBuild integration, and runtime library.

Splitting the serialized ATN into segments: Currently supported by the tool for the Java target, but would require changes to CSharpTarget.java and to the C# code generation templates.

EvgeniyKo commented 10 years ago

How long does it take? Maybe I can help resolve this issue.

EvgeniyKo commented 10 years ago

Splitting the serialized ATN into segments: Currently supported by the tool for the Java target, but would require changes to CSharpTarget.java and to the C# code generation templates.

As far as I understand from the java template, it generates an array of strings and then it calls Utils.join() instead of concatenated string.

I suppose this solution will work for me.

EvgeniyKo commented 10 years ago

I have fixed the issue. Link to the build with the fix: https://drive.google.com/file/d/0B4sUnvtGhlljalhzQktldE1KdW8/edit?usp=sharing