jakartaee / jaxb-api

BSD 3-Clause "New" or "Revised" License
58 stars 42 forks source link

Unmarshalling hexBinary with leading whitespaces yields `null` #312

Open netmikey opened 3 months ago

netmikey commented 3 months ago

We have the following situation:

Our XSD defines an element as hexBinary as follows:

<xs:element name="atr" type="xs:hexBinary">
</xs:element>

We got XML documents that contain this (line break intended!):

...
  <atr>
    6f1087c8105312e302e30820408a00005001588306312e3021000000a5049f1c611a4d02156501ff</atr>
...

Note that some XML editor probably introduced the carriage return / newline and formatting spaces within the atr element in an attempt to "pretty print" the file.

When we validate this file against the XSD, the validation passes.

However, when we unmarshal it, the atr byte array that should contain the data is set to null. While debugging, we found that the HexBinaryAdapter which is generated into the Java bindings like this:

    @XmlElement(required = true, type = String.class)
    @XmlJavaTypeAdapter(HexBinaryAdapter.class)
    @XmlSchemaType(name = "hexBinary")
    protected byte[] atr;

... obtains the full value of the element, including the line breaks and the leading spaces.

https://github.com/jakartaee/jaxb-api/blob/748c50bb8f71b4687febca5400fa9c574644aac3/api/src/main/java/jakarta/xml/bind/annotation/adapters/HexBinaryAdapter.java#L29-L32

It does not trim those non-hex characters and passes the value on to DatatypeConverter#parseHexBinary, which seems to assume only hex characters:

https://github.com/jakartaee/jaxb-api/blob/748c50bb8f71b4687febca5400fa9c574644aac3/api/src/main/java/jakarta/xml/bind/DatatypeConverterImpl.java#L437-L443

This IllegalArgumentException gets thrown and then somewhere down the line apparently dropped in favor of simply setting the field to null.

The inconvenient part here is that the JAXB unmarshalling doesn't seem to be compliant with the XSD validation: our validation passes but the application still fails afterwards because of this.

antoniosanct commented 3 months ago

@netmikey This snippet works in jaxb-api

@Test
public void testHexBinary() {
    final String originStr = "6f1087c8105312e302e30820408a00005001588306312e3021000000a5049f1c611a4d02156501ff";
        //\u006f\u0010\ufffd\ufffd\u0010\u0053\u0012\ufffd\u0002\ufffd\u0008\u0020\u0040\ufffd\u0000\u0000\u0050\u0001\u0058\ufffd\u0006\u0031\u002e\u0030\u0021\u0000\u0000\u0000\ufffd\u0004\ufffd\u001c\u0061\u001a\u004d\u0002\u0015\u0065\u0001\ufffd
    final byte[] decoded = DatatypeConverter.parseHexBinary(originStr);
    final String encoded = DatatypeConverter.printHexBinary(decoded);
    Assert.assertEquals(originStr, encoded.toLowerCase());
}

Maybe is a real bug in eclipse-ee4j/jaxb-ri ??

Regards, Antonio.

netmikey commented 3 months ago

@antoniosanct

Not sure I was able to make my point clear...

To have your test reproduce what actually happens at runtime, it should test HexBinaryAdapter (this in turn calls DataTypeConcerter) and most importantly include the newline and whitespace content of the element like this (binary content shortened for readability):

@Test
public void testHexBinary() {
    final String originStr = "\n    6f1087c8105312e302e30820408a00005001588306312e";
    // the following fails:
    final byte[] decoded = new HexBinaryAdapter().unmarshal(originStr);
}
antoniosanct commented 3 months ago

@netmikey My point is there is no problem about hex binary conversion. Although you maybe think that hex binary adapter should trim the input string before unmarshalling?

All characters included between node tags, including non-visible characters, are valid characters for hex binary conversion.

@lukasj has a better top-down vision about JAXB api. He maybe can help you about the problem.

Regards, Antonio.

netmikey commented 3 months ago

All characters included between node tags, including non-visible characters, are valid characters for hex binary conversion.

I'm afraid they're not:

@Test
public void testHexBinaryConversion() {
    // OK
    new jakarta.xml.bind.DatatypeConverterImpl().parseHexBinary("1234567890abcdef");
    // java.lang.IllegalArgumentException: hexBinary needs to be even-length
    new jakarta.xml.bind.DatatypeConverterImpl().parseHexBinary("\n1234567890abcdef");
    // java.lang.IllegalArgumentException: contains illegal character for hexBinary
    new jakarta.xml.bind.DatatypeConverterImpl().parseHexBinary("\n   1234567890abcdef");
}
antoniosanct commented 3 months ago

@netmikey You're right about non-visible characters, my mistake! However, I've my doubts about if jaxb-api have to clean up such characters. The solution is easy: set a trimmed string in HexBinaryAdapter.

Regards, Antonio.

netmikey commented 3 months ago

What do you mean by "set a trimmed string in HexBinaryAdapter"? I'm not calling HexBinaryAdapter manually. I'm just using the library and expect it to unmarshal an XML document without errors.

Something like:

JAXBContext context = JAXBContext.newInstance(Book.class);
context.createUnmarshaller().unmarshal(new FileReader("./book.xml"));
antoniosanct commented 3 months ago

@netmikey You don't! But you could maybe create a PR fixing this issue at this project!

Regards, Antonio.

netmikey commented 3 months ago

Oh, yes, of course. Where do you think trimming the hex binary string would make the most sense then? Since this particular behavior seems to be hexBinary-specific, my first impulse would be adding it in HexBinaryAdapter before passing it to DatatypeConverter.

Would you be open for such a PR?