Add symbol lexer rules for more easy tree walking

pieterbos commented 8 years ago

See what i changed to the grammars in https://github.com/nedap/archie/commit/538415ef632d4b0a2309ce5034be0e987d597023

It is rather inconvenient to parse the '<'-characters without having them as named tokens. It would be a good idea to name them.

This applies to the following symbols:

; <

<=

There will probably be more situations when i parse more of the grammar.

wolandscat commented 8 years ago

That is what I used to do in the old grammar, but in the current grammar, the '<' and '>' characters are also the delimiters for ODIN blocks. So if we declare a lex pattern SYM_GT: '>' in the keyword / symbol set, we are going to have problems aren't we? Will the ODIN parser rules that contain '>' still function or not?

Another approach on this is to do what my old grammars did - treat the outer 'ADL' syntax as a shell syntax, containing 'islands' of CADL, ODIN and RULES syntax. I am starting to think I should go back to this design.

ghost commented 8 years ago

Isn't it possible to use another lexer mode when parsing Odin. Op 30 okt. 2015 15:41 schreef "Thomas Beale" notifications@github.com:

That is what I used to do in the old grammar, but in the current grammar, the '<' and '>' characters are also the delimiters for ODIN blocks. So if we declare a lex pattern SYM_GT: '>' in the keyword / symbol set, we are going to have problems aren't we? Will the ODIN parser rules that contain '>' still function or not?

— Reply to this email directly or view it on GitHub https://github.com/openEHR/adl-antlr/issues/19#issuecomment-152543900.

wolandscat commented 8 years ago

yes - that is how it would be done. We would still need to detect the big chunks of ODIN, CADL, and RULES and then feed them to 3 separate parsers / parser modes. But I think it might work out better. I am afraid it will break everyone's code a bit, but it's early days, right? I would like to know from others if you consider this an appropriate thing to do. From a parser design POV, it seems clearer to me, and it should get rid of some ambiguities like the regex one. And there are clearly multiple different syntaxes here.

ghost commented 8 years ago

On 30-10-15 16:27, Thomas Beale wrote:

yes - that is how it would be done. We would still need to detect the big chunks of ODIN, CADL, and RULES and then feed them to 3 separate parsers / parser modes. But I think it might work out better. I am afraid it will break everyone's code a bit, but it's early days, right? I would like to know from others if you consider this an appropriate thing to do. From a parser design POV, it seems clearer to me, and it should get rid of some ambiguities like the regex one. And there are clearly multiple different syntaxes here.

— Reply to this email directly or view it on GitHub https://github.com/openEHR/adl-antlr/issues/19#issuecomment-152555991.

I understand that you want more opinions, I would want them too. But the question is hard to ask on another forum, it is a too long story.

But it is an almost classical case of having more lexer modes, because there are more language-grammars involved in one structure. It is in the book, chapter 12.3: Islands in the Stream, page 221.

As example, the preprocessor which parses a C-file before the compiler does it. And of course, XML and HTML, which has inside-tags-mode and, could have a data-mode (between tags) As a fourth parser you could also distinguish the Regular Expression parser, which we discussed yesterday.

It is also possible to run more parsers separately, split the archetypes before, using external software, but the antlr-parser has the opportunity to do it in one run, that is a performance advantage.

It doesn't matter that code needs to be rewritten, that is normal in such an early stage.

Bert

pieterbos commented 8 years ago

I think the parser is fine as it is, the only thing wrong being the regular expression parser. Not related to this issue.

You added several literals to the parser rules, like '<', '>', '>=' and ';'. What ANTLR does is if it does not find a matching lexer rule, is to create a lexer rule for you, implicitly.

So if you add SYM_GT: '>', all you do is add a name to the lexer rule. The lexer remains exactly the same apart from this name. Odin parsing will remain to work fine as it is, which i tested as well.

ghost commented 8 years ago

Maybe later, when I have time, I create a grammar with lexer modes, just for fun. I think you can get a better listener, but I am not sure, must try it. I let it know what the result is, when I have tried it.

If it works fine as it is now, we should be glad.

If it ain't broke, don't fix it, was the golden rule where I worked 15 years ago.

Best regards Bert Op 30 okt. 2015 23:10 schreef "Pieter Bos" notifications@github.com:

I think the parser is fine as it is, the only thing wrong being the regular expression parser. Not related to this issue.

You added several literals to the parser rules, like '<', '>', '>=' and ';'. What ANTLR does is if it does not find a matching lexer rule, is to create a lexer rule for you, implicitly.

So if you add SYM_GT: '>', all you do is add a name to the lexer rule. The lexer remains exactly the same apart from this name. Odin parsing will remain to work fine as it is, which i tested as well.

— Reply to this email directly or view it on GitHub https://github.com/openEHR/adl-antlr/issues/19#issuecomment-152662133.

wolandscat commented 8 years ago

I have now added lexer rules for SYM_GT etc.

pieterbos commented 8 years ago

great, that fixes this - thanks!

openEHR / adl-antlr

Add symbol lexer rules for more easy tree walking #19

<=