lexer errors on regular expressions

pieterbos commented 9 years ago

There is no regular expression lexer rule present. So if you try to parse for example

            allow_archetype OBSERVATION[id2] occurrences matches {0..1} matches {   -- Vital signs
                include
                    archetype_id/value matches {/openEHR-EHR-OBSERVATION\.exam(-a-zA-Z0-9_]+)*\.v1/}
                exclude
                    archetype_id/value matches {/openEHR-EHR-OBSERVATION\.blood_pressure(-a-zA-Z0-9_]+)*\.v1/}
            }

You get lexer errors:

trying to parse adl2-tests/validity/slots/openEHR-EHR-SECTION.VDSEV_slot_include_not_any_exclude_not_any.v1.adls
line 30:74 token recognition error at: '_'
line 32:84 token recognition error at: '_'

wolandscat commented 9 years ago

There is a remaining error in rule design to do with regexes; this will be fixed soon.

pieterbos commented 9 years ago

Did a quick test of your regular expression changes. I don't know if it was meant to fix most cases already, but testing is just a minute of work with the tests of my work-in-progress parser. . It does not seem to parse the following:

archetype_id/value matches {/.*/}

It does however parse:

archetype_id/value matches {/openEHR-EHR-CLUSTER\.specimen\.v1/}

The messages:

 trying to parse adl2-tests/features/specialisation/openEHR-EHR-OBSERVATION.ordering_parent
line 62:40 token recognition error at: '.*'
line 62:42 extraneous input '/' expecting {'(', ')', '=', '[', ']', '{', '}', ',', '*', ':', '-', '^', '!=', '<=', '<', '>=', '>', '+', '_', '\.', '\', '|', ROOT_ID_CODE, ID_CODE, AT_CODE, AC_CODE, DATE_CONSTRAINT_PATTERN, TIME_CONSTRAINT_PATTERN, DATE_TIME_CONSTRAINT_PATTERN, DURATION_CONSTRAINT_PATTERN, SYM_ARCHETYPE, SYM_TEMPLATE_OVERLAY, SYM_TEMPLATE, SYM_OPERATIONAL_TEMPLATE, SYM_SPECIALIZE, SYM_LANGUAGE, SYM_DESCRIPTION, SYM_DEFINITION, SYM_RULES, SYM_TERMINOLOGY, SYM_ANNOTATIONS, SYM_COMPONENT_TERMINOLOGIES, SYM_ADL_VERSION, SYM_RM_RELEASE, SYM_IS_CONTROLLED, SYM_IS_GENERATED, SYM_UID, SYM_BUILD_UID, SYM_EXISTENCE, SYM_OCCURRENCES, SYM_CARDINALITY, SYM_ORDERED, SYM_UNORDERED, SYM_UNIQUE, SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, SYM_INCLUDE, SYM_EXCLUDE, SYM_AFTER, SYM_BEFORE, SYM_CLOSED, SYM_THEN, SYM_AND, SYM_OR, SYM_XOR, SYM_NOT, SYM_IMPLIES, SYM_FOR_ALL, SYM_EXISTS, SYM_MATCHES, '...', '..', WS, '
', '--------------------*', CMT_LINE, ISO8601_DATE, ISO8601_TIME, ISO8601_DATE_TIME, ISO8601_DURATION, SYM_TRUE, SYM_FALSE, ARCHETYPE_HRID, ARCHETYPE_REF, ARCHETYPE_HRID_ROOT, VERSION_ID, NAMESPACE, GUID, ALPHA_UC_ID, ALPHA_LC_ID, TERM_CODE_REF, URI, INTEGER, REAL, STRING, CHARACTER, ';'}
line 67:24 missing '}' at 'value'
line 67:48 no viable alternative at input '-EHR'
line 67:93 mismatched input '_' expecting {'=', '/', '*', '-', '^', '!=', '<=', '<', '>=', '>', '+'}
line 68:10 extraneous input 'exclude' expecting {'}', SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, ALPHA_UC_ID}
line 69:40 token recognition error at: '.*'
line 69:42 extraneous input '/' expecting {'(', ')', '=', '[', ']', '{', '}', ',', '*', ':', '-', '^', '!=', '<=', '<', '>=', '>', '+', '_', '\.', '\', '|', ROOT_ID_CODE, ID_CODE, AT_CODE, AC_CODE, DATE_CONSTRAINT_PATTERN, TIME_CONSTRAINT_PATTERN, DATE_TIME_CONSTRAINT_PATTERN, DURATION_CONSTRAINT_PATTERN, SYM_ARCHETYPE, SYM_TEMPLATE_OVERLAY, SYM_TEMPLATE, SYM_OPERATIONAL_TEMPLATE, SYM_SPECIALIZE, SYM_LANGUAGE, SYM_DESCRIPTION, SYM_DEFINITION, SYM_RULES, SYM_TERMINOLOGY, SYM_ANNOTATIONS, SYM_COMPONENT_TERMINOLOGIES, SYM_ADL_VERSION, SYM_RM_RELEASE, SYM_IS_CONTROLLED, SYM_IS_GENERATED, SYM_UID, SYM_BUILD_UID, SYM_EXISTENCE, SYM_OCCURRENCES, SYM_CARDINALITY, SYM_ORDERED, SYM_UNORDERED, SYM_UNIQUE, SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, SYM_INCLUDE, SYM_EXCLUDE, SYM_AFTER, SYM_BEFORE, SYM_CLOSED, SYM_THEN, SYM_AND, SYM_OR, SYM_XOR, SYM_NOT, SYM_IMPLIES, SYM_FOR_ALL, SYM_EXISTS, SYM_MATCHES, '...', '..', WS, '
', '--------------------*', CMT_LINE, ISO8601_DATE, ISO8601_TIME, ISO8601_DATE_TIME, ISO8601_DURATION, SYM_TRUE, SYM_FALSE, ARCHETYPE_HRID, ARCHETYPE_REF, A

followed by more errors due to no multiline-string support yet.

Directly before logging these errors, ANTLR also logs that it attempts full context to my ANTLRErrorListener. Haven't checked what it means yet.

pieterbos commented 9 years ago

I don't see any more regular expression parse errors for now. But I have not done a full test, for example i have not tested regular expressions with for non-western characters.

wolandscat commented 9 years ago

This problem is not yet fixed; the current 'fix' is an unreliable band-aid - I am still researching the best solution for this one.

pieterbos commented 9 years ago

I know of some people in our company who have experience with ANTLR. I'll try asking them for suggestions.

I do think this is a language construct that is hard to parse the way it is now. Also because the following ADL seems to be valid and has a use valid case, but is not a regular expression:

definition
    OBSERVATION[id1.1] matches {   /data[id2]/events[id7]/data[id4]/items matches { /data[id5]/value matches {

A rather ugly solution would be to create a preprocessor that matches all regular expressions with a line-based scanner, then replace all the '/'-characters at the beginning and end with '^'. Only then run the lexer.

wolandscat commented 9 years ago

The current rules for regex are a nasty hack; they are in the cadl_primitives.g4 file, and are currently as follows:

// for REGEX: strip first and last char, and then process with PCRE grammar
c_string: ( string_value | string_list_value | regex_constraint ) assumed_string_value? ;
assumed_string_value: ';' string_value ;
regex_constraint: '/' regex1 '/' | '^' regex2 '^' ;
regex1: ( '_' | '.*' | '\\.' | '\\/' | ~'/' )+ ; // TODO: not clear why first 3 matches are needed, but they work.
regex2: ( '_' | '.*' | '\\.' | '\\/' | ~'^' )+ ;

The remaining problem is that if you introduce Lexer rules for the regex part, a regex expression can have almost anything in it, and the lexer will then start matching other strings of characters that turn up between '/' characters, i.e. in paths.

I originally posted to the Antlr google group on this, asking if there was a way to use the fact that the previous token was a '{' to conditionally turn the regex matching rule on; I didn't get any clear answer.

One idea is simply to say that regexes in ADL2 have to be between some character other than '/', e.g. the alternate '^' character included in the current grammar. This means I have to change the ADL Workbench to convert all the existing regexes in slots to ^^ form rather than // form; probably not hard, and maybe that's what we should do. But I have this nagging feeling something is wrong - matching regexes correctly was dead easy in my old yacc/lex grammars in Eiffel. Antlr is difficult for solving some simple problems!

pieterbos commented 9 years ago

Ah, but i know a way to possibly do that. It's called Lexical Mode. See:

https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules#LexerRules-LexicalModes

You can use the three commands mode, popMode and pushMode like you use the skip command that you use on whitespace and comments, like so:

TokenName : «alternative» -> command-name TokenName : «alternative» -> command-name («identifier or integer»)

Then you can specify lexer rules to only exist in a specific mode. So if you do something like:

LPARENT: '{' -> mode('regexp_allowed')

You can switch the lexer to a mode where it allows regular expressions, when it normally does not. Then two questions remain:

When do you switch back to the default mode? regexpes can occur directly, or within a tuple, so after anything that is not a regular expression or the start of a tuple?
In templates, you can type a path directly after a 'matches {', and it's a path, not a regular expression. How to solve the possible ambiguity there?

wolandscat commented 9 years ago

Yep I know about lexical mode. I think it won't help here because there is no way to know when to enter the 'regex allowed' mode. The rule you propose will enter it just because there is a '{', but that symbol is ubiquitous in ADL, and pretty much every possible CADL element can come after it. So that implies that the 'regex allowed' mode is pretty much all of the CADL rules as they now are.

I couldn't see a way to define such a rule that would actually work, and also keep the grammar comprehensible. Happy to be proved wrong!

ghost commented 9 years ago

Maybe it is an idea to start the another lexical mode by {/ and end it by /}?

wolandscat commented 9 years ago

I don't know if that can work or not - note that the '{' matching is done inline in the parser rules, which is how I think it should be. If we introduce a lexer pattern for '{/' how does it interact with that other matching?

ghost commented 9 years ago

On 28-10-15 15:00, Thomas Beale wrote:

I don't know if that can work or not - note that the '{' matching is done inline in the parser rules, which is how I think it should be. If we introduce a lexer pattern for '{/' how does it interact with that other matching?

— Reply to this email directly or view it on GitHub https://github.com/openEHR/adl-antlr/issues/7#issuecomment-151854497.

I have it from the book The Definitive ANTLR4 Reference by Terence Parr. Page 223 where the issue of XML is used as example. In XML, text inside the tags must be parsed different from text between the tags., it is in fact the same problem, and because in ADL (1.4) regular expressions were always between {/ /}, it could work.

The only thing I have not tried is if it is possible to change the lexer mode with two characters instead of one (as in the book example), but, when they are tokenized, I think it could work

ghost commented 9 years ago

I wonder, has it been tried, changing the lexer mode when {/ or /} occurs? Or is it not a viable solution?

You can always say, that if it is not possible for a grammar to distinguish a mode for certainty, that something is wrong with that grammar.

There is a rule that if more then one lexer rule match the same input sequence, the priority goes to the first occuring rule. Another rule says that the lexer recognizes the most input characters

So if you have the token {/ defined above the token {

It is easy to check by creating a small test grammar.

grammar Test; r : TEXT | LEFT_CURLY_PLUS_SLASH TEXT SLASH_PLUS_RIGHT_CURLY | LEFT_CURLY TEXT RIGHT_CURLY ; LEFT_CURLY_PLUS_SLASH: '{/' ; LEFT_CURLY: '{' ; SLASH_PLUS_RIGHT_CURLY: '/}' ; RIGHT_CURLY: '}' ; TEXT:[a-z]+ ; WS: [ \r\t\n]+ -> skip;

You can see that it recognizes {/ as a different token from { {/abc/} [@0,0:1='{/',<1>,1:0] [@1,2:4='abc',<5>,1:2] [@2,5:6='/}',<3>,1:5] [@3,8:7='',<-1>,2:0]

{abc} [@0,0:0='{',<2>,1:0] [@1,1:3='abc',<5>,1:1] [@2,4:4='}',<4>,1:4] [@3,6:5='',<-1>,2:0]

Here the rule does not exist, and it reports an error, but parses the rest. {{/abc} [@0,0:0='{',<2>,1:0] [@1,1:2='{/',<1>,1:1] [@2,3:5='abc',<5>,1:3] [@3,6:6='}',<4>,1:6] [@4,8:7='',<-1>,2:0] line 1:1 extraneous input '{/' expecting TEXT

Here also {/{abc/} [@0,0:1='{/',<1>,1:0] [@1,2:2='{',<2>,1:2] [@2,3:5='abc',<5>,1:3] [@3,6:7='/}',<3>,1:6] [@4,9:8='',<-1>,2:0] line 1:2 extraneous input '{' expecting TEXT

pieterbos commented 9 years ago

I just fixed it with a rather simple pragmatich approach: match { / REGEXP ASSUMED_VALUE? / } as one token in the lexer, plus optional whitespace. Now all you have to do is parse that token in your treewalker. I did the parsing with a very small separate ANTLR-grammar. Solution is in:

https://github.com/nedap/archie/commit/aaab8ea6f919622d197e22aa3de742ca26517339

I implemented tests, and I could not break it, although I did find https://github.com/openEHR/adl-antlr/issues/20 (unrelated to regular expressions, but prevents me from writing more tests).

This is simple and small and it covers all cases. You could even do it in the parser or lexer if you allow java/target language code in your grammar. Other possible solutions that will work:

Add a preprocessor which rewrites the regexpes starting and ending with '^', before the lexer
Manipulate the token stream between the lexer and the parser, and add some extra tokens in java
write custom lexer code in java(target language) in your grammar
redefine ADL so that regular expressions always start with a '^'

Why matching '{\' will not work without custom java code in your lexer:

apart from that you have to match '{ WS* /' instead of '{/', the following valid ADL is the reason:

TYPE[id1] matches {/start/test matches {/this should work/}}

'/start/test' is a path, '/this should work/' is a regexp.

So you have to look ahead to ahead of the next / before you can determine it's a regular expression or a path. And you cannot do that with lexical modes.

openEHR / adl-antlr

lexer errors on regular expressions #7