Fix antlr/antlr4#1890 Standalone CR should be recognized as line separator

nixel2007 commented 4 years ago

Hello!

I've made changes to treat stand-alone CR as line separator. This changes were tested on millions of LOC with-out any noticable perfomance regress. Also I use my fork to parse files everyday as a part of static analysis process.

P.S. Link to my PR to original ANTLR4 repo opened half an year ago...

daniellansun commented 4 years ago

The builds on CI servers failed

nixel2007 commented 4 years ago

Hi, @danielsun1106

I see an error in build logs:

testCharSetRange(org.antlr.v4.test.tool.TestLexerExec)  Time elapsed: 0.078 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:<...0]
[@1,4:5='34',<1>,[1:4]
[@2,7:8='a2',<2>,1:7]
[@3,10:12='abc',<2>,1:10]
[@4,18:17='<EOF>',<-1>,2]:3]
> but was:<...0]
[@1,4:5='34',<1>,[2:1]
[@2,7:8='a2',<2>,2:4]
[@3,10:12='abc',<2>,2:7]
[@4,18:17='<EOF>',<-1>,3]:3]
>
    at org.junit.Assert.assertEquals(Assert.java:115)
    at org.junit.Assert.assertEquals(Assert.java:144)
    at org.antlr.v4.test.tool.TestLexerExec.testCharSetRange(TestLexerExec.java:497)

But I don't understand what exactly numbers after the second comma mean. Could you guide me about this a little?

The first number is a token type, i suggest. And the second and the third ones?

daniellansun commented 4 years ago

@nixel2007 I can not remember its meaning either... @sharwell should be able to tell us ;-)

sharwell commented 4 years ago

[@2,7:8='a2',<2>,2:4]

Item	Meaning
`@2`	Token index (this in the third token)
`7:8`	Token span (character position) in the original input
`a2`	Token text
`<2>`	Token type
`2:4`	Starting line:column for the token

Note that I'm not planning to take this change. It should be possible to derive from the existing types in the library if you need to customize the token positioning mechanism. (It would have been better if tokens didn't track this information at all, but it's too late to remove now.)

nixel2007 commented 4 years ago

a-ha, test string contains standalone CR, so the second 34 token is placed to the 2nd line. so, new numbers is correct.

https://github.com/tunnelvisionlabs/antlr4/blob/53920110c84977d854157a54752373596980542c/tool/test/org/antlr/v4/test/tool/TestLexerExec.java#L486

It should be possible to derive from the existing types in the library if you need to customize the token positioning mechanism.

could you point me to this classes/methods? I haven't digged it a lot, but I suppose that if I'll try to change token position somewhere else than LexerATNSimulator#consume, I will have to store line/character offsets and change them for every token rather than one-time change in consume. Not good for RAM neither CPU.

sharwell commented 4 years ago

@nixel2007 LexerATNSimulator.consume is a virtual method. Create your own type derived from that, and override consume to handle this case.

nixel2007 commented 4 years ago

Create your own type derived from that, and override consume to handle this case.

Did as you said, thank you. One small question - is there any option to override default constructor from g4 file? The only way I've found for this is a constructor with another signature:

https://github.com/1c-syntax/bsl-parser/commit/86bbfc5ab44002b7d82228660292722ad2512731#diff-9f09662df115bb4cb45e8ad2afc23bbdR25

sharwell commented 4 years ago

You can generate your grammar as abstract if you wish to provide a different constructor in code. I forget the exact command line syntax but it's handled by this code:

https://github.com/tunnelvisionlabs/antlr4/blob/bdcd934691932f24a4ec43e1ac79479c524862aa/tool/src/org/antlr/v4/tool/Grammar.java#L625-L627

nixel2007 commented 4 years ago

Ok, i'll try it. thank you!

nrmancuso commented 3 years ago

@nixel2007 did you end up successfully overriding the LexerATNSimulator constructor by generating your grammar as abstract to solve this issue? Do you have a working example that you can point me to?

nixel2007 commented 3 years ago

@nmancus1 yep. I've created CRAwareLexerATNSimulator (https://github.com/1c-syntax/bsl-parser/blob/master/src/main/java/com/github/_1c_syntax/bsl/parser/CRAwareLexerATNSimulator.java) and pass it to lexer in members block - https://github.com/1c-syntax/bsl-parser/blob/855158d4e7948326f76cf32b33ebd16c580896be/src/main/antlr/BSLLexer.g4#L29-L35. no need to make grammar abstract.

tunnelvisionlabs / antlr4

Fix antlr/antlr4#1890 Standalone CR should be recognized as line separator #58