openEHR / adl-antlr

Antrl4 grammars for ADL
Apache License 2.0
1 stars 4 forks source link

Time constraint parsing gives errors #14

Closed pieterbos closed 8 years ago

pieterbos commented 8 years ago
    time_attr1 matches {hh:mm:ss}
        time_attr2 matches {hh:mm:??}
        time_attr3 matches {hh:mm:XX}
        time_attr4 matches {hh:??:??}
        time_attr5 matches {hh:??:XX}

gives errors:

[Test worker] INFO com.nedap.archie.adlparser.LargeSetOfADLsTest - trying to parse adl2-tests/features/aom_structures/basic/openEHR-TEST_PKG-WHOLE.assumed_values.v1.adls
[Test worker] WARN com.nedap.archie.adlparser.ADLErrorListener - FULL CONTEXT: 135-135, alts: {2, 5}
[Test worker] WARN com.nedap.archie.adlparser.ADLErrorListener - FULL AMBIGUITY: 135-135, exact: false
line 59:22 no viable alternative at input 'hh:mm:ss;'
line 60:22 no viable alternative at input 'hh:mm:XX;'
line 61:22 no viable alternative at input 'hh:??:XX;'
line 62:22 no viable alternative at input 'hh:??:??;'

I haven't checked the token stream, but the ambiguity warning at token 135 could be a cause.

wolandscat commented 8 years ago

So far I cannot discover a reason for this problem - the Antlr debugging tools are extremely primitive, so I may be missing something obvious.

However, I noticed that the date and date/time patterns parse correctly, i.e.

THING[id4] matches {
    date_time_attr1 matches {yyyy-mm-XXThh:mm:ss}
    date_attr1 matches {yyyy-mm-??}
}

This shows that the time pattern 'hh:mm:ss' is described properly by the lex rules, since the date/time constraint pattern (the first one) is specified by a rule that re-uses the time constraint pattern.

I also noticed that the time pattern parses if it commences with capital 'HH' or has a space at the end, i.e.

THING[id4] matches {
    time_attr1 matches {HH:mm:ss}  // pass
    time_attr1 matches {hh:mm:ss }  // pass
    time_attr1 matches {hh:mm:ss}   // FAIL
}

I cannot tell what the parser is seeing in the last line, although for some reason it is running past the 'ss' and including the '}'. The error I get is:

line 4:24 mismatched input 'hh:mm:ss}' expecting {'[', '/', '-', '^', '+', '|', DATE_CONSTRAINT_PATTERN, TIME_CONSTRAINT_PATTERN, DATE_TIME_CONSTRAINT_PATTERN, DURATION_CONSTRAINT_PATTERN, SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, SYM_AFTER, SYM_BEFORE, ISO8601_DATE, ISO8601_TIME, ISO8601_DATE_TIME, ISO8601_DURATION, SYM_TRUE, SYM_FALSE, ALPHA_UC_ID, INTEGER, REAL, STRING}

t says here that it can't match TIME_CONSTRAINT_PATTERN, but that must be because the scanner has read 'hh:mm:ss}' instead of just 'hh:mm:ss'. I don't understand this behaviour.

pieterbos commented 8 years ago

I can't check now, but you can run the antlr runtime with the -tokens parameter, and it will print the token stream from the lexer. Then you can lookup the symbols corresponding to the numbers in the output with the adl.tokens file that is generated and see what is wrong.i think you are right about an ambiguity in the lexer.

I'll check on Monday if you haven't already.

pieterbos commented 8 years ago

Looking at the lexer rules - the URI rule matches the time pattern, as in, it will not match if it starts with a capital letter but will match with a lowercase 'hh' at the start.

wolandscat commented 8 years ago

Aha - yes that's it - I had forgotten the URI pattern was defined as more or less 'anything'. I've replaced it with a reasonable proper URI scanner rule set. It appears to correctly scan some URIs that I have tested it with and also fix the problem reported here.

Thanks for the testing - I think we are getting close to something.

pieterbos commented 8 years ago

I integrated your latest modifications in https://github.com/nedap/archie . Looks like it works very well, the time patterns parse, and so do the URIs.

There's one URI that did not parse, but it contains invalid characters for an URL. A webbrowser however will usually convert them to URL-encoded characters: http://air93.org/MZN-SIF-Dihanje-Kašelj in the file openEHR-EHR-OBSERVATION.value_set_binding.v1.adls

I don't think that needs fixing, although I'm not sure what the specification says about the URI syntax.

pieterbos commented 8 years ago

This is definately fixed, The URI looks fine as it is.

wolandscat commented 8 years ago

We probably need to upgrade the URI spec and grammar rules to allow UTF-8 chars, but I think we might leave this to a separate issue.