Ambiguous comment token definition in ODIN grammar

serefarikan commented 6 years ago

The base patterns file here: https://github.com/openEHR/adl-antlr/blob/master/src/main/antlr/base_patterns.g4 which is used in odin grammar defines comment tokens as follows:

H_CMT_LINE : '--------' '-'? '\n' ; // special type of comment for splitting template overlays CMT_LINE : '--' .? '\n' -> skip ; // (increment line count)

This introduces a problem with antlr 4.7, but interestingly, 4.5 manages to parse input, which I think should not be possible: probably Antlr's smart parsing algorithms are relaxing the rules a bit too much.

Since Antlr considers the declaration of tokens in the grammar files during parsing, the H_CMT_LINE token overrides the CMT_LINE token in the following examples, taken from openEHR RM bmm (CIMI bmms have same types of comments too):

------------------------------------------------------
-- BMM version on which these schemas are based.
------------------------------------------------------
bmm_version = <"2.1">

------------------------------------------------------
-- schema identification
-- (schema_id computed as <rm_publisher>_<schema_name>_<rm_release>)
------------------------------------------------------

The long comments above are meant to be comments and they don't imply the special template overlay comment. However, based on the grammar file above, the lexer is bound to match these long comment lines to H_CMT_LINE which I'd say is the correct behaviour. The result, using antlr 4.7 (the latest version) is that the parse tree never goes beyond the long comment lines.

Changing the order of CMT_LINE and H_CMT_LINE fixes the issue but I'm sure it'd break the behaviour within archetype files where template overlays are expressed.

I'd suggest changing the template overlay comment token to something else. Maybe something like:

H_CMT_LINE : '-/' ''*? '\n'  ;  // special type of comment for splitting template overlays

wolandscat commented 6 years ago

It seems to me that a simpler approach is to move the H_CMT_LINE to one of the ADL Antlr files, because that special long line is only special in between whole ADL texts. So the ODIN Antlr file would go back to just matching with the CMT_LINE rule.

pieterbos commented 6 years ago

I don't think that will work - the lexer rules are the same for all the files and not specific for one of them, since you use Odin within adl. So you can move them but it will not be handled like two different levers.

In fact, having all the lexer rules in one antlr file makes sense because of this.

Unless you have a way in the lexer to know when Odin parts start and end, then you can define different lexer modes.

But perhaps there is some way in the parser to define when the special comment lines are comments and when they are used to separate overlays.

wolandscat commented 6 years ago

Yep, you are right - that won't work in Antlr. We need a solution based on modes. Or else we do what Seref said, and define a different kind of comment line. Maybe an easy one would be:

//------------------------------

serefarikan commented 6 years ago

I guess upper layers of software that go beyond bmm processing (which is what I'm doing) would need to use modes anyway but it is how we know we're supposed to use the mode_x that is the problem, when the token level ambiguity exits. So I'd stick to my different type of comment suggestion. Your suggestion for syntax is just fine Tom.

pieterbos commented 6 years ago

for lexical modes you need to be able to switch modes in the lexer. I don't think you can, because there's no unambiguous token for when Odin starts and ends - not in the lexer. It's not like we have a CDATA tag like in XML or a /* to start a comment. It's why I could not use lexical modes when I tried to unambiguously fix both regular expressions and path parsing, and had to resort to handling them as a single token in Archie.

With some custom target language code it's probably possible - but I think that has some serious drawbacks since you would need a version per target language.

Handling this in the parser would be hard because these comments can appear literally everywhere.

I don't see an easy solution in the current syntax.

serefarikan commented 6 years ago

I could not find an example of the template overlay use case for the special comment token so I'm a bit in the dark re its use in that context but it my ignorant guess is switching to a CDATA-ish or / ... / comment format would help, based on my understanding of your comments @pieterbos

This is a step further than what @wolandscat is suggesting but it sounds like a more future proof approach.

wolandscat commented 6 years ago

Example of a single file template.

The original attraction in using long lines was that they visually separate the overlays nicely and that they are already ADL comments anyway.

pieterbos commented 6 years ago

Oh, some other ways to solve this:

do not require comment lines in the parser. Of course, this means they are no longer required which can be a drawback
treat the comment lines, the whitespace and template_overlay as one single token

I don't think these are very elegant ways though, but both should work.

serefarikan commented 6 years ago

@pieterbos just to clarify: Are you suggesting that the parser uses only template_overlays keyword to identify t. overlays and not use the long comment line in your first suggestion?

openEHR / adl-antlr

Ambiguous comment token definition in ODIN grammar #31