Open pieterbos opened 9 years ago
There is a remaining error in rule design to do with regexes; this will be fixed soon.
Did a quick test of your regular expression changes. I don't know if it was meant to fix most cases already, but testing is just a minute of work with the tests of my work-in-progress parser. . It does not seem to parse the following:
archetype_id/value matches {/.*/}
It does however parse:
archetype_id/value matches {/openEHR-EHR-CLUSTER\.specimen\.v1/}
The messages:
trying to parse adl2-tests/features/specialisation/openEHR-EHR-OBSERVATION.ordering_parent
line 62:40 token recognition error at: '.*'
line 62:42 extraneous input '/' expecting {'(', ')', '=', '[', ']', '{', '}', ',', '*', ':', '-', '^', '!=', '<=', '<', '>=', '>', '+', '_', '\.', '\', '|', ROOT_ID_CODE, ID_CODE, AT_CODE, AC_CODE, DATE_CONSTRAINT_PATTERN, TIME_CONSTRAINT_PATTERN, DATE_TIME_CONSTRAINT_PATTERN, DURATION_CONSTRAINT_PATTERN, SYM_ARCHETYPE, SYM_TEMPLATE_OVERLAY, SYM_TEMPLATE, SYM_OPERATIONAL_TEMPLATE, SYM_SPECIALIZE, SYM_LANGUAGE, SYM_DESCRIPTION, SYM_DEFINITION, SYM_RULES, SYM_TERMINOLOGY, SYM_ANNOTATIONS, SYM_COMPONENT_TERMINOLOGIES, SYM_ADL_VERSION, SYM_RM_RELEASE, SYM_IS_CONTROLLED, SYM_IS_GENERATED, SYM_UID, SYM_BUILD_UID, SYM_EXISTENCE, SYM_OCCURRENCES, SYM_CARDINALITY, SYM_ORDERED, SYM_UNORDERED, SYM_UNIQUE, SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, SYM_INCLUDE, SYM_EXCLUDE, SYM_AFTER, SYM_BEFORE, SYM_CLOSED, SYM_THEN, SYM_AND, SYM_OR, SYM_XOR, SYM_NOT, SYM_IMPLIES, SYM_FOR_ALL, SYM_EXISTS, SYM_MATCHES, '...', '..', WS, '
', '--------------------*', CMT_LINE, ISO8601_DATE, ISO8601_TIME, ISO8601_DATE_TIME, ISO8601_DURATION, SYM_TRUE, SYM_FALSE, ARCHETYPE_HRID, ARCHETYPE_REF, ARCHETYPE_HRID_ROOT, VERSION_ID, NAMESPACE, GUID, ALPHA_UC_ID, ALPHA_LC_ID, TERM_CODE_REF, URI, INTEGER, REAL, STRING, CHARACTER, ';'}
line 67:24 missing '}' at 'value'
line 67:48 no viable alternative at input '-EHR'
line 67:93 mismatched input '_' expecting {'=', '/', '*', '-', '^', '!=', '<=', '<', '>=', '>', '+'}
line 68:10 extraneous input 'exclude' expecting {'}', SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, ALPHA_UC_ID}
line 69:40 token recognition error at: '.*'
line 69:42 extraneous input '/' expecting {'(', ')', '=', '[', ']', '{', '}', ',', '*', ':', '-', '^', '!=', '<=', '<', '>=', '>', '+', '_', '\.', '\', '|', ROOT_ID_CODE, ID_CODE, AT_CODE, AC_CODE, DATE_CONSTRAINT_PATTERN, TIME_CONSTRAINT_PATTERN, DATE_TIME_CONSTRAINT_PATTERN, DURATION_CONSTRAINT_PATTERN, SYM_ARCHETYPE, SYM_TEMPLATE_OVERLAY, SYM_TEMPLATE, SYM_OPERATIONAL_TEMPLATE, SYM_SPECIALIZE, SYM_LANGUAGE, SYM_DESCRIPTION, SYM_DEFINITION, SYM_RULES, SYM_TERMINOLOGY, SYM_ANNOTATIONS, SYM_COMPONENT_TERMINOLOGIES, SYM_ADL_VERSION, SYM_RM_RELEASE, SYM_IS_CONTROLLED, SYM_IS_GENERATED, SYM_UID, SYM_BUILD_UID, SYM_EXISTENCE, SYM_OCCURRENCES, SYM_CARDINALITY, SYM_ORDERED, SYM_UNORDERED, SYM_UNIQUE, SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, SYM_INCLUDE, SYM_EXCLUDE, SYM_AFTER, SYM_BEFORE, SYM_CLOSED, SYM_THEN, SYM_AND, SYM_OR, SYM_XOR, SYM_NOT, SYM_IMPLIES, SYM_FOR_ALL, SYM_EXISTS, SYM_MATCHES, '...', '..', WS, '
', '--------------------*', CMT_LINE, ISO8601_DATE, ISO8601_TIME, ISO8601_DATE_TIME, ISO8601_DURATION, SYM_TRUE, SYM_FALSE, ARCHETYPE_HRID, ARCHETYPE_REF, A
followed by more errors due to no multiline-string support yet.
Directly before logging these errors, ANTLR also logs that it attempts full context to my ANTLRErrorListener. Haven't checked what it means yet.
I don't see any more regular expression parse errors for now. But I have not done a full test, for example i have not tested regular expressions with for non-western characters.
This problem is not yet fixed; the current 'fix' is an unreliable band-aid - I am still researching the best solution for this one.
I know of some people in our company who have experience with ANTLR. I'll try asking them for suggestions.
I do think this is a language construct that is hard to parse the way it is now. Also because the following ADL seems to be valid and has a use valid case, but is not a regular expression:
definition
OBSERVATION[id1.1] matches { /data[id2]/events[id7]/data[id4]/items matches { /data[id5]/value matches {
A rather ugly solution would be to create a preprocessor that matches all regular expressions with a line-based scanner, then replace all the '/'-characters at the beginning and end with '^'. Only then run the lexer.
The current rules for regex are a nasty hack; they are in the cadl_primitives.g4 file, and are currently as follows:
// for REGEX: strip first and last char, and then process with PCRE grammar
c_string: ( string_value | string_list_value | regex_constraint ) assumed_string_value? ;
assumed_string_value: ';' string_value ;
regex_constraint: '/' regex1 '/' | '^' regex2 '^' ;
regex1: ( '_' | '.*' | '\\.' | '\\/' | ~'/' )+ ; // TODO: not clear why first 3 matches are needed, but they work.
regex2: ( '_' | '.*' | '\\.' | '\\/' | ~'^' )+ ;
The remaining problem is that if you introduce Lexer rules for the regex part, a regex expression can have almost anything in it, and the lexer will then start matching other strings of characters that turn up between '/' characters, i.e. in paths.
I originally posted to the Antlr google group on this, asking if there was a way to use the fact that the previous token was a '{' to conditionally turn the regex matching rule on; I didn't get any clear answer.
One idea is simply to say that regexes in ADL2 have to be between some character other than '/', e.g. the alternate '^' character included in the current grammar. This means I have to change the ADL Workbench to convert all the existing regexes in slots to ^^ form rather than // form; probably not hard, and maybe that's what we should do. But I have this nagging feeling something is wrong - matching regexes correctly was dead easy in my old yacc/lex grammars in Eiffel. Antlr is difficult for solving some simple problems!
Ah, but i know a way to possibly do that. It's called Lexical Mode. See:
https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules#LexerRules-LexicalModes
You can use the three commands mode, popMode and pushMode like you use the skip command that you use on whitespace and comments, like so:
TokenName : «alternative» -> command-name TokenName : «alternative» -> command-name («identifier or integer»)
Then you can specify lexer rules to only exist in a specific mode. So if you do something like:
LPARENT: '{' -> mode('regexp_allowed')
You can switch the lexer to a mode where it allows regular expressions, when it normally does not. Then two questions remain:
Yep I know about lexical mode. I think it won't help here because there is no way to know when to enter the 'regex allowed' mode. The rule you propose will enter it just because there is a '{', but that symbol is ubiquitous in ADL, and pretty much every possible CADL element can come after it. So that implies that the 'regex allowed' mode is pretty much all of the CADL rules as they now are.
I couldn't see a way to define such a rule that would actually work, and also keep the grammar comprehensible. Happy to be proved wrong!
Maybe it is an idea to start the another lexical mode by {/ and end it by /}?
I don't know if that can work or not - note that the '{' matching is done inline in the parser rules, which is how I think it should be. If we introduce a lexer pattern for '{/' how does it interact with that other matching?
On 28-10-15 15:00, Thomas Beale wrote:
I don't know if that can work or not - note that the '{' matching is done inline in the parser rules, which is how I think it should be. If we introduce a lexer pattern for '{/' how does it interact with that other matching?
— Reply to this email directly or view it on GitHub https://github.com/openEHR/adl-antlr/issues/7#issuecomment-151854497.
I have it from the book The Definitive ANTLR4 Reference by Terence Parr. Page 223 where the issue of XML is used as example. In XML, text inside the tags must be parsed different from text between the tags., it is in fact the same problem, and because in ADL (1.4) regular expressions were always between {/ /}, it could work.
The only thing I have not tried is if it is possible to change the lexer mode with two characters instead of one (as in the book example), but, when they are tokenized, I think it could work
I wonder, has it been tried, changing the lexer mode when {/ or /} occurs? Or is it not a viable solution?
You can always say, that if it is not possible for a grammar to distinguish a mode for certainty, that something is wrong with that grammar.
There is a rule that if more then one lexer rule match the same input sequence, the priority goes to the first occuring rule. Another rule says that the lexer recognizes the most input characters
So if you have the token {/ defined above the token {
It is easy to check by creating a small test grammar.
grammar Test; r : TEXT | LEFT_CURLY_PLUS_SLASH TEXT SLASH_PLUS_RIGHT_CURLY | LEFT_CURLY TEXT RIGHT_CURLY ; LEFT_CURLY_PLUS_SLASH: '{/' ; LEFT_CURLY: '{' ; SLASH_PLUS_RIGHT_CURLY: '/}' ; RIGHT_CURLY: '}' ; TEXT:[a-z]+ ; WS: [ \r\t\n]+ -> skip;
You can see that it recognizes {/ as a different token from {
{/abc/}
[@0,0:1='{/',<1>,1:0]
[@1,2:4='abc',<5>,1:2]
[@2,5:6='/}',<3>,1:5]
[@3,8:7='
{abc}
[@0,0:0='{',<2>,1:0]
[@1,1:3='abc',<5>,1:1]
[@2,4:4='}',<4>,1:4]
[@3,6:5='
Here the rule does not exist, and it reports an error, but parses the rest.
{{/abc}
[@0,0:0='{',<2>,1:0]
[@1,1:2='{/',<1>,1:1]
[@2,3:5='abc',<5>,1:3]
[@3,6:6='}',<4>,1:6]
[@4,8:7='
Here also
{/{abc/}
[@0,0:1='{/',<1>,1:0]
[@1,2:2='{',<2>,1:2]
[@2,3:5='abc',<5>,1:3]
[@3,6:7='/}',<3>,1:6]
[@4,9:8='
I just fixed it with a rather simple pragmatich approach: match { / REGEXP ASSUMED_VALUE? / } as one token in the lexer, plus optional whitespace. Now all you have to do is parse that token in your treewalker. I did the parsing with a very small separate ANTLR-grammar. Solution is in:
https://github.com/nedap/archie/commit/aaab8ea6f919622d197e22aa3de742ca26517339
I implemented tests, and I could not break it, although I did find https://github.com/openEHR/adl-antlr/issues/20 (unrelated to regular expressions, but prevents me from writing more tests).
This is simple and small and it covers all cases. You could even do it in the parser or lexer if you allow java/target language code in your grammar. Other possible solutions that will work:
Why matching '{\' will not work without custom java code in your lexer:
apart from that you have to match '{ WS* /' instead of '{/', the following valid ADL is the reason:
TYPE[id1] matches {/start/test matches {/this should work/}}
'/start/test' is a path, '/this should work/' is a regexp.
So you have to look ahead to ahead of the next / before you can determine it's a regular expression or a path. And you cannot do that with lexical modes.
There is no regular expression lexer rule present. So if you try to parse for example
You get lexer errors: