Closed GoogleCodeExporter closed 8 years ago
This is a limitation of the JDK regex compiler, which does not support \i, and
uses \c for control characters rather than XML name characters.
The workaround is to use a different SchemaRegexCompiler (method
setRegExCompiler()). We cannot fix the issue in the JDK regex compiler,
obviously.
The workaround for using the JDK regex compiler is to expand these shorthands:
\i = [_:A-Za-z]
\c = [-._:A-Za-z0-9]
For the example given, leave out the colon, of course.
The only regex compiler currently supplied is the one included with the JDK.
Consequently, since this cannot be fixed by us in the JDK, I've changed the
type from 'defect' to 'enhancement', as the solution will be to develop a full
regex compiler (or front-end the jdk one, which is an interesting exercise in
recursion), and use the alternate regex compiler workaround. I've also lowered
the priority, as we're unlikely to have the cycles to address this soon;
consider it a known limitation of using the JDK regex compiler.
Original comment by aale...@gmail.com
on 3 Oct 2011 at 6:06
Ooops.
Also, the JDK does not support character class subtraction. If you use the
syntax [a-z-[aeiou]], you're going to match something rather different than
'ascii consonants'. Instead, you'll match the character class including a-z, -,
[, a, e, i, o, u; followed by a literal ].
So there are at least two schema regex constructs that are not safe for use.
Original comment by aale...@gmail.com
on 3 Oct 2011 at 6:19
An alternative solution would be to replace the unsupported character class
with an equivalent, supported expression. That way, we'd only have to check
for the existence of certain character classes, e.g. '\i', and then replace the
bits we don't understand w/bits that we do.
Original comment by joe.bays...@gmail.com
on 3 Oct 2011 at 6:23
JDK has a different syntax for subtraction. Ex:
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)
You'd have to parse and translate from W3C syntax to JDK one:
- constructs like \i into [_:\p{L}] (notice Unicode letters are supported in
JDK)
- subtractions in sets
Original comment by anli.shu...@gmail.com
on 3 Oct 2011 at 6:29
Note that Xerces also does full XML Schema validation, so we should check
whether they have the same problem, or whether they've already got the code to
work around the issue.
Original comment by eric%tib...@gtempaccount.com
on 3 Oct 2011 at 6:39
Xerces almost certainly isn't "validating" text nodes in schema by converting
them to "atomic types", which is what we do.
The issue here is that the schema spec says "there is a thing called the
post-schema-validation infoset, which includes a concept of a 'value space'.
Validation happens in the value space, not the syntax space." In order to
support this ... [censored] hand-waving in the spec, a validator--including a
schema parser, which implicitly validates each schema--has to implement a value
space, which, since it isn't going to be the value space of the programming
language in which the parser/validator is written, is going to require creation
of new types to represent each of the proliferating "primitive" types in the
spec.
Now ... in *my* opinion, the validation of the pattern facet shouldn't happen
"in the value space"--the pattern facet is a lexical facet, anyway. So we
should be just checking the syntax, not compiling the damned regexes. Leave
the problem of compiling to the validator, not the parser. That's at odds with
the philosophy of the parser, though, which validates values and facets by
putting them into the "value space" and checking to see whether an exception
was thrown.
Original comment by aale...@gmail.com
on 3 Oct 2011 at 6:56
It appears that the code to fix this already exists, but was disabled when the
code went from proprietary to open source. Unfortunately, the working code is
in the validator, which depends on the parser (which means that the parser
cannot depend upon the validator).
The solution is going to involve moving the regular expression api and
implementations to somewhere where the parser can see them (without preventing
the validator from seeing and using them). Just exactly how to do that is open
to question. They could be stuffed into the parser, or made into a separate
module, or even stuffed into the bridgekit. Some options are less attractive
than others. The idea of having a separate module for handling regular
expressions is slightly awkward, for instance: it suggests that regular
expressions have more importance than they properly do. Likewise, bridge
developers really aren't likely to want this functionality, so putting it in
bridgekit seems odd. Stuffing into the parser also seems awkward, since it's
mostly a validation function.
Changed back to defect, and raised the priority, with apologies to the original
reporter.
Original comment by aale...@gmail.com
on 3 Oct 2011 at 7:58
This issue was closed by revision r277.
Original comment by aale...@gmail.com
on 3 Oct 2011 at 8:32
Fixed at r277. Moved the remaining regex stuff into the parser; recreated the
xsdl-specific regex compiler; made it the default.
Original comment by aale...@gmail.com
on 3 Oct 2011 at 8:32
Original issue reported on code.google.com by
abokhan%...@gtempaccount.com
on 3 Oct 2011 at 5:42