sshyran / genxdm

Automatically exported from code.google.com/p/genxdm
0 stars 0 forks source link

<xsd:pattern value="[\i-[:]][\c-[:]]*"/> cannot be parsed by the schema parser #69

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Create a schema with the following element definition:
   <xsd:element maxOccurs="1" minOccurs="1" name="id">
      <xsd:simpleType>
         <xsd:restriction base="xsd:string">
            <xsd:pattern value="[\i-[:]][\c-[:]]*"/>
         </xsd:restriction>
      </xsd:simpleType>
   </xsd:element>

2. Try to parse it with W3cXmlSchemaParser

What is the expected output?

ComponentBag.

What do you see instead?

Caused by: org.genxdm.processor.w3c.xs.exception.sm.SmAttributeUseException: 
cvc-complex-type.3.1: The attribute, 'value', is not valid with respect to its 
attribute use.
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.pattern(XMLSchemaConverter.java:1943)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.computePatterns(XMLSchemaConverter.java:503)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertSimpleType(XMLSchemaConverter.java:1490)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertType(XMLSchemaConverter.java:1564)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertType(XMLSchemaConverter.java:1584)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertElement(XMLSchemaConverter.java:1054)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertElementUse(XMLSchemaConverter.java:1142)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertModelGroup(XMLSchemaConverter.java:1331)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertModelGroupUse(XMLSchemaConverter.java:1382)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.effectiveContent(XMLSchemaConverter.java:1725)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertContentType(XMLSchemaConverter.java:830)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertComplexType(XMLSchemaConverter.java:789)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convertTypes(XMLSchemaConverter.java:1599)
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.convert(XMLSchemaConverter.java:1984)
                at org.genxdm.processor.w3c.xs.impl.XMLParserImpl.convert(XMLParserImpl.java:152)
                at org.genxdm.processor.w3c.xs.impl.XMLParserImpl.parse(XMLParserImpl.java:96)
                ... 23 more
Caused by: org.genxdm.xs.exceptions.SimpleTypeException: The initial value 
'[\i-[:]][\c-[:]]*' is not valid with respect to the simple type definition 
'{unknown}'.
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.pattern(XMLSchemaConverter.java:1936)
                ... 38 more
Caused by: org.genxdm.xs.exceptions.DatatypeException: cvc-datatype-valid.?: 
The literal '[\i-[:]][\c-[:]]*' is not datatype-valid with respect to the 
datatype definition 'null'.
                at org.genxdm.processor.w3c.xs.impl.XMLSchemaConverter.pattern(XMLSchemaConverter.java:1935)
                ... 38 more

What version of the product are you using? On what operating system?

0.9, windows 7 x64

Please provide any additional information below.

Joe: I think the issue is that GenXDM's regex compiler doesn't understand "\i" 
(and may not understand "\c").  I believe that lack of understanding is a 
defect because "\i" is part of the schema spec.  

Original issue reported on code.google.com by abokhan%...@gtempaccount.com on 3 Oct 2011 at 5:42

GoogleCodeExporter commented 8 years ago
This is a limitation of the JDK regex compiler, which does not support \i, and 
uses \c for control characters rather than XML name characters.

The workaround is to use a different SchemaRegexCompiler (method 
setRegExCompiler()). We cannot fix the issue in the JDK regex compiler, 
obviously.

The workaround for using the JDK regex compiler is to expand these shorthands:

\i = [_:A-Za-z]
\c = [-._:A-Za-z0-9]

For the example given, leave out the colon, of course.

The only regex compiler currently supplied is the one included with the JDK.  
Consequently, since this cannot be fixed by us in the JDK, I've changed the 
type from 'defect' to 'enhancement', as the solution will be to develop a full 
regex compiler (or front-end the jdk one, which is an interesting exercise in 
recursion), and use the alternate regex compiler workaround. I've also lowered 
the priority, as we're unlikely to have the cycles to address this soon; 
consider it a known limitation of using the JDK regex compiler.

Original comment by aale...@gmail.com on 3 Oct 2011 at 6:06

GoogleCodeExporter commented 8 years ago
Ooops.

Also, the JDK does not support character class subtraction.  If you use the 
syntax [a-z-[aeiou]], you're going to match something rather different than 
'ascii consonants'. Instead, you'll match the character class including a-z, -, 
[, a, e, i, o, u; followed by a literal ].

So there are at least two schema regex constructs that are not safe for use.

Original comment by aale...@gmail.com on 3 Oct 2011 at 6:19

GoogleCodeExporter commented 8 years ago
An alternative solution would be to replace the unsupported character class 
with an equivalent, supported expression.  That way, we'd only have to check 
for the existence of certain character classes, e.g. '\i', and then replace the 
bits we don't understand w/bits that we do.

Original comment by joe.bays...@gmail.com on 3 Oct 2011 at 6:23

GoogleCodeExporter commented 8 years ago
JDK has a different syntax for subtraction. Ex:

[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)

You'd have to parse and translate from W3C syntax to JDK one:
- constructs like \i into [_:\p{L}] (notice Unicode letters are supported in 
JDK)
- subtractions in sets

Original comment by anli.shu...@gmail.com on 3 Oct 2011 at 6:29

GoogleCodeExporter commented 8 years ago
Note that Xerces also does full XML Schema validation, so we should check 
whether they have the same problem, or whether they've already got the code to 
work around the issue.

Original comment by eric%tib...@gtempaccount.com on 3 Oct 2011 at 6:39

GoogleCodeExporter commented 8 years ago
Xerces almost certainly isn't "validating" text nodes in schema by converting 
them to "atomic types", which is what we do.

The issue here is that the schema spec says "there is a thing called the 
post-schema-validation infoset, which includes a concept of a 'value space'. 
Validation happens in the value space, not the syntax space."  In order to 
support this ... [censored] hand-waving in the spec, a validator--including a 
schema parser, which implicitly validates each schema--has to implement a value 
space, which, since it isn't going to be the value space of the programming 
language in which the parser/validator is written, is going to require creation 
of new types to represent each of the proliferating "primitive" types in the 
spec.

Now ... in *my* opinion, the validation of the pattern facet shouldn't happen 
"in the value space"--the pattern facet is a lexical facet, anyway.  So we 
should be just checking the syntax, not compiling the damned regexes.  Leave 
the problem of compiling to the validator, not the parser.  That's at odds with 
the philosophy of the parser, though, which validates values and facets by 
putting them into the "value space" and checking to see whether an exception 
was thrown.

Original comment by aale...@gmail.com on 3 Oct 2011 at 6:56

GoogleCodeExporter commented 8 years ago
It appears that the code to fix this already exists, but was disabled when the 
code went from proprietary to open source.  Unfortunately, the working code is 
in the validator, which depends on the parser (which means that the parser 
cannot depend upon the validator).

The solution is going to involve moving the regular expression api and 
implementations to somewhere where the parser can see them (without preventing 
the validator from seeing and using them).  Just exactly how to do that is open 
to question.  They could be stuffed into the parser, or made into a separate 
module, or even stuffed into the bridgekit.  Some options are less attractive 
than others.  The idea of having a separate module for handling regular 
expressions is slightly awkward, for instance: it suggests that regular 
expressions have more importance than they properly do.  Likewise, bridge 
developers really aren't likely to want this functionality, so putting it in 
bridgekit seems odd.  Stuffing into the parser also seems awkward, since it's 
mostly a validation function.

Changed back to defect, and raised the priority, with apologies to the original 
reporter.

Original comment by aale...@gmail.com on 3 Oct 2011 at 7:58

GoogleCodeExporter commented 8 years ago
This issue was closed by revision r277.

Original comment by aale...@gmail.com on 3 Oct 2011 at 8:32

GoogleCodeExporter commented 8 years ago
Fixed at r277.  Moved the remaining regex stuff into the parser; recreated the 
xsdl-specific regex compiler; made it the default.

Original comment by aale...@gmail.com on 3 Oct 2011 at 8:32