openEHR / adl-antlr

Antrl4 grammars for ADL
Apache License 2.0
1 stars 4 forks source link

interval and list parsing needs quite a lot of code #21

Closed pieterbos closed 8 years ago

pieterbos commented 8 years ago

Because the list values have been fully specified like, they are a bit inconvenient to parse:

primitive_list_value : 
      string_list_value 
    | integer_list_value 
    | real_list_value 
    | boolean_list_value 
    | character_list_value 
    | term_code_list_value
    | date_list_value
    | time_list_value 
    | date_time_list_value 
    | duration_list_value 
    ;

primitive_interval_value :
      integer_interval_value
    | real_interval_value
    | date_interval_value
    | time_interval_value
    | date_time_interval_value
    | duration_interval_value
    ;

A similar pattern applies to cadl. But it leads to the following copy/paste code - and the following file is just for real and integer numbers, the same applies to all other types:

https://github.com/nedap/archie/blob/bfa1e4cfcfeec27100384af323c5fbbb365e2c50/parser/src/main/java/com/nedap/archie/adlparser/NumberConstraintParser.java

So it is quite a lot of repetitive copy/paste code to handle these constructs

You could do instead:

primitive_list_value :  
    primitive_value ( ( ',' primitive_value )+ | ',' SYM_LIST_CONTINUE ) ;
primitive_interval_value: 
    '|' SYM_GT? primitive_value SYM_INTERVAL_SEP SYM_LT? primitive_value 
    '|' relop primitive_value

This allows for reuse of code both in java and the grammar.

it has the drawback of having a bit less validation in the ANTLR-grammar, and needing a bit more validation in the tree walker/listener. But this validation is rather simple:

wolandscat commented 8 years ago

Hm... I'm starting to see the downside of the antlr approach. Consider the original. .y file - there is very little replicated code in here, and the parser provides proper definition and low-level validation of the allowed language syntax. This is because the 'replicated' code is generated by the parser generator.

As soon as the grammar is simplified, the generator has less to work with, and must generate less code, leaving more for the human programmer.

I suspect that the current Antlr code generators are weaker than the very mature generators available for .l/.y grammars, thus forcing the human programmer to do more work. I'm not in principle in favour of watering down a language grammar from being a proper definition to something weaker, in order to compensate for poor code generation. If we do that, then the formal definition of the language is very hard to locate, since it is no longer in one place - it's scattered through the grammar and also various code in various implementations. Which implementation is to be taken as definitional? It also becomes very hard to document...

pieterbos commented 8 years ago

I understand why you would not want to change this and I did implement the full treewalker for this already, so it's no problem.

ANTLr can actually generate more code than i use now (visitor and listener) and I could probably do the treewalking in less code, but the options I have thought of so far were a bit complicated. It's also possible that I just don't know the best option to fix this, not sure. Perhaps @BertVerhees has a suggestion?

wolandscat commented 8 years ago

Right. I don't want to imply we wouldn't ever change things like this, but we need to be conscious of what the possibilities are, and really understanding what the technology can do. Right now I'm relying on you and others to investigate the consequences for the code development side of things, at least until I can catch up.

I think there is quite a lot to collectively learn on Antrl4.x and it's pretty new. I think we have to consider that the current phase is experimental for a while. I am currently working on a branch to separate out the grammars, as I think that will clarify things better. We'll see if it works.

ghost commented 8 years ago

On 02-11-15 13:05, Pieter Bos wrote:

I understand why you would not want to change this and I did implement the full treewalker for this already, so it's no problem.

ANTLr can actually generate more code than i use now (visitor and listener) and I could probably do the treewalking in less code, but the options I have thought of so far were a bit complicated. It's also possible that I just don't know the best option to fix this, not sure. Perhaps @BertVerhees https://github.com/BertVerhees has a suggestion?

— Reply to this email directly or view it on GitHub https://github.com/openEHR/adl-antlr/issues/21#issuecomment-152999329.

I think, you don't need a visitor, unless you need it ;-)

I don't know for what you use archetype, I use them for validating RM-objects, I did until now in XML-instances (Schematron), it is (not so) easy :-( to walk through an AOM tree and create an Schematron, but it works.

The parser hands over the listener tree, but in fact, you only could be in need of one parser method.

I am working on a grammar, for a project (that is why I am so busy, and under pressure), nothing to do with OpenEHR, but most Grammars have a Top-Object, and everything below are children, it is a Tree. So if you split out the different Grammars, then you have three tree's, with three Top-objects. Every top-object has two listeners, one Entry-listener, and one Exit listener.

On Exit the whole tree is ready. So use the Exit Top listener object.

For my purpose, it is like this, (I do it my way, maybe it is better to do in another way, I am happy to learn)

public class Antlr4MyListenerextends MyBaseListener {

 MyParser.CompilationUnitContextopCompilationUnit;
 MyParser.CompilationUnitContextctxCompilationUnit;
 OPGenerator.OPCompilationUnitContextopCompilationUnitContext;

 public OPGenerator.OPCompilationUnitContext getOpCompilationUnit() {
     return opCompilationUnitContext;
 }

 @Override public void exitCompilationUnit(@NotNull MyParser.CompilationUnitContext ctx)  {
     try {
         opCompilationUnitContext =new OPGenerator.OPCompilationUnitContext(null, ctx);
     }catch (InvocationTargetException e) {
         e.printStackTrace();
     }catch (NoSuchMethodException e) {
         e.printStackTrace();
     }catch (InstantiationException e) {
         e.printStackTrace();
     }catch (IllegalAccessException e) {
         e.printStackTrace();
     }
 }

}

The returned context is the full tree of that Grammar.

You see, the ParserContext is given as parameter to the top-object of my OPGenerator.OPCompilationUnitContext (This should be the Archetype-object in AOM)

The OPGenerator looks very much he same as the parser, it has a class for every Parser-class (this is because the AOM has a class for every rule) The AOM and Parser should mirror each other. So you could generate the AOM from the grammar too.

I explain.

public static class OPCompilationUnitContextextends OPParserRuleContext { public OPCompilationUnitContext(OPParserRuleContext parent, ParserRuleContext current)throws InvocationTargetException, NoSuchMethodException, InstantiationException, IllegalAccessException { super(parent, current); setToken(((MyParser.CompilationUnitContext)current).EOF()); OPGenerator.addChild(this, ((MyParser.CompilationUnitContext) current).someDeclaration(), OPSomeDeclarationContext.class); } @Override public int getRuleIndex() { return MyParser.RULE_compilationUnit; } }

The OPCompilationUnitContext and the OPGenerator.OPSomeDeclarationContext could be AOM classes, mirroring the Parser-classes Just like the Parser, I have all these classes in one file, they only have small classes/rules, most is only giving through to the children. (in my case)

The OPGenerator.addChild does the distribution-work.

It is like this

public static void addChild(OPParserRuleContext parent, Object lCtx, Class clazz)throws IllegalAccessException, InvocationTargetException, InstantiationException, NoSuchMethodException { if(lCtx==null) return; if(lCtxinstanceof List){ for (Object aLCtx : (List)lCtx) { Constructor constructor = clazz.getConstructor(OPParserRuleContext.class, ParserRuleContext.class); parent.addChild((OPParserRuleContext)constructor.newInstance(parent, aLCtx)); } }else{ Constructor constructor = clazz.getConstructor(OPParserRuleContext.class, ParserRuleContext.class); parent.addChild((OPParserRuleContext)constructor.newInstance(parent, lCtx)); } }

It has this as parameter, because, if we need a parent (you never know), and the child object is created from the Parser-object. OPSomeDeclaration()

Then it creates the OPGenerator.OPSomeDeclarationContext and returns it to "this" as child, and in OPGenerator.OPSomeDeclarationContext the process is repeated but for other classes further in the tree.

Take a look at the Parser, you could, as said, write a routine to generate the AOM-classes, if your grammar is conforming the AOM.

In this way, you do not touch the generated Parser (never touch generated code), you have most of your code generated, also the AOM, and

the grammar is leading it all.

So, the AOM is build, and conform the rules, in the same way, a Schematron validator could be build for XML instances, and even, I read somewhere on the internet, Java-classes, representing a single archetype can be generated, I don't know if it is possible, but that could speed up the kernel.

Please remember, I am happy to learn, if you have better ways, I like to hear about it.

Best regards Bert

pieterbos commented 8 years ago

Bert, thanks for your explanation.

Thomas, I checked your .y file - there you add the tree walking/object model generation by adding code after every element. You have the exact same code duplication as I do, just in a different place. You could do the same approach with the tree walking and grammar in the same file in ANTLR as you did in Yacc.

I talked to a colleague who has a bit more ANTLR-experience than I do. He sees benefits and drawbacks in both approaches and he said it's mostly a matter of taste what you pick.

So I guess it's a matter of taste, and I think it's safe for me to close this issue :)

ghost commented 8 years ago

On 02-11-15 14:51, Pieter Bos wrote:

Bert, thanks for your explanation.

Thomas, I checked your .y file - there you add the tree walking/object model generation by adding code after every element. You have the exact same code duplication as I do, just in a different place. You could do the same approach with the tree walking and grammar in the same file in ANTLR as you did in Yacc.

I talked to a colleague who has a bit more ANTLR-experience than I do. He sees benefits and drawbacks in both approaches and he said it's mostly a matter of taste what you pick.

So I guess it's a matter of taste, and I think it's safe for me to close this issue :)

— Reply to this email directly or view it on GitHub https://github.com/openEHR/adl-antlr/issues/21#issuecomment-153022367.

When thinking it over, I don 't think it is possible to create a AOM by grammar, so I correct myself

Bert