lucmoreau / ProvToolbox

Java toolkit to create and convert W3C PROV data model representations, and build provenance-enabled applications in a variety of programming languages (java, python, typescript, javascript)
Other
75 stars 42 forks source link

Uninformative MarshalExceptions because of invalid XML characters #89

Closed moxious closed 10 years ago

moxious commented 10 years ago

I've been seeing a lot of exceptions of this form.

It would appear there's an invalid XML character somewhere, possibly >, <, or &. This exception though makes it difficult to track down exactly where it is, or what can be done about it. Because I'm using the model classes and ProvFactory, it also seems this shouldn't be possible; in the event that I put a value in that isn't valid XML (e.g. an "Other" whose value is something like "x > y") then shouldn't the process of serializing this as XML (as opposed to other formats) handle the encoding specifics of XML for me, and know to replace > with &gt;?

So it would be helpful to wrap such exceptions with contextual information about what exactly failed to marshal, and perhaps do more checking on valid values when the model is being created (prior to marshalling) to avoid this coming up.

Exception in thread "main" javax.xml.bind.MarshalException
 - with linked exception:
[com.sun.istack.SAXException2: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. 
org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. ]
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:323)
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:249)
    at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:116)
    at org.openprovenance.prov.xml.ProvSerialiser.serialiseDocument(ProvSerialiser.java:121)
    at org.mitre.provenance.plusobject.prov.PROVConverter.asXMLString(PROVConverter.java:429)
    at org.mitre.provenance.test.TestPROV.checkPROV(TestPROV.java:39)
    at org.mitre.provenance.test.TestPROV.dumpPROV(TestPROV.java:33)
    at org.mitre.provenance.test.TestPROV.main(TestPROV.java:45)
lucmoreau commented 10 years ago

how did you construct your Other value?

Luc

On 04/24/2014 01:44 PM, moxious wrote:

I've been seeing a lot of exceptions of this form.

It would appear there's an invalid XML character somewhere, possibly

, <, or &. This exception though makes it difficult to track down exactly where it is, or what can be done about it. Because I'm using the model classes and ProvFactory, it also seems this shouldn't be possible; in the event that I put a value in that isn't valid XML (e.g. an "Other" whose value is something like "x > y") then shouldn't the process of serializing this as XML (as opposed to other formats) handle the encoding specifics of XML for me, and know to replace > with >?

So it would be helpful to wrap such exceptions with contextual information about what exactly failed to marshal, and perhaps do more checking on valid values when the model is being created (prior to marshalling) to avoid this coming up.

Exception in thread "main" javax.xml.bind.MarshalException

  • with linked exception: [com.sun.istack.SAXException2: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. ] at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:323) at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:249) at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:116) at org.openprovenance.prov.xml.ProvSerialiser.serialiseDocument(ProvSerialiser.java:121) at org.mitre.provenance.plusobject.prov.PROVConverter.asXMLString(PROVConverter.java:429) at org.mitre.provenance.test.TestPROV.checkPROV(TestPROV.java:39) at org.mitre.provenance.test.TestPROV.dumpPROV(TestPROV.java:33) at org.mitre.provenance.test.TestPROV.main(TestPROV.java:45)

— Reply to this email directly or view it on GitHub https://github.com/lucmoreau/ProvToolbox/issues/89.Web Bug from https://github.com/notifications/beacon/142275__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzk2MjY5OCwiZGF0YSI6eyJpZCI6MzA3MDY1ODV9fQ==--c6eb507d19d4e907279678d044065e7cca4e7c66.gif

Professor Luc Moreau Electronics and Computer Science tel: +44 23 8059 4487 University of Southampton fax: +44 23 8059 2865 Southampton SO17 1BJ email: l.moreau@ecs.soton.ac.uk United Kingdom http://www.ecs.soton.ac.uk/~lavm

moxious commented 10 years ago

I have many of them done in several different ways; but often I'm using a method like this for testing:

    /**
     * Make a single object property into an "Other" statement.
     * @param name property name
     * @param value property value
     * @param prefix ns prefix
     * @return
     */
    protected Other makeObjectProperty(String name, Object value, String prefix) {
        QualifiedName nameType = this.name.XSD_STRING;      

        if(value instanceof Date) {
            nameType = this.name.XSD_DATETIME;

            // Needs to be valid XML date format.
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
            value = sdf.format((Date)value);
        } else if(value instanceof Integer) {
            nameType = this.name.XSD_INTEGER;
        } else if(value instanceof Double) {
            nameType = this.name.XSD_DOUBLE;
        } else {
            value = ""+value;  // Coerce to string
        }

        return factory.newOther(BASE_NAMESPACE, name, prefix, value, nameType);
    } // End makeObjectProperty
moxious commented 10 years ago

This makeObjectProperty method is getting called repetitively as the graph nodes I have contain metadata items in a HashMap<String,Object> - keys end up as names, values as values to this method. But to be clear: I am not even sure that this method of created Other objects is what's causing the problem - the exception doesn't really provide any insight. So part of the issue is first narrowing down where this is even happening.

lucmoreau commented 10 years ago

Have you got an example of value you pass to makeObjectProperty and that makes jaxb serialization raise an exception?

On 04/24/2014 02:12 PM, moxious wrote:

I have many of them done in several different ways; but often I'm using a method like this for testing:

 /**
  * Make a single object property into an "Other" statement.
  * @param name property name
  * @param value property value
  * @param prefix ns prefix
  * @return
  */
 protected Other makeObjectProperty(String name, Object value, String prefix) {
     QualifiedName nameType = this.name.XSD_STRING;

     if(value instanceof Date) {
         nameType = this.name.XSD_DATETIME;

         // Needs to be valid XML date format.
         SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
         value = sdf.format((Date)value);
     } else if(value instanceof Integer) {
         nameType = this.name.XSD_INTEGER;
     } else if(value instanceof Double) {
         nameType = this.name.XSD_DOUBLE;
     } else {
         value = ""+value;  // Coerce to string
     }

     return factory.newOther(BASE_NAMESPACE, name, prefix, value, nameType);
 } // End makeObjectProperty

— Reply to this email directly or view it on GitHub https://github.com/lucmoreau/ProvToolbox/issues/89#issuecomment-41277759.Web Bug from https://github.com/notifications/beacon/142275__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzk2NDMzNiwiZGF0YSI6eyJpZCI6MzA3MDY1ODV9fQ==--b0b26b08244a37910263ad45c07408cf1efda8eb.gif

Professor Luc Moreau Electronics and Computer Science tel: +44 23 8059 4487 University of Southampton fax: +44 23 8059 2865 Southampton SO17 1BJ email: l.moreau@ecs.soton.ac.uk United Kingdom http://www.ecs.soton.ac.uk/~lavm

moxious commented 10 years ago

I am not sure that it is makeObjectProperty() that is doing this. That is part of the problem: the exception provides no hints at what is invalid. This method (makeObjectProperty()) is just my guess, but I have inspected it output, and attempted to replace invalid characters, to no effect.

Yes, I have examples but they are not simple source files I can provide. They are larger workflows within a repository we manage.

lucmoreau commented 10 years ago

As the stack traces shows, the problem occurs in JAXB serialization. I am unclear of what can be done in ProvToolbox: suggestions welcome!

There are many ways of failing serialization: QNames/QualifiedNames may not be syntactically correct. The Object value in TypedValue is expected to be syntactically valid for xml serialization. To satisfying a validating serializer, the value should be compative with the type ...

http://openprovenance.org/java/site/0_5_0/apidocs/org/openprovenance/prov/model/TypedValue.html

Luc

On 04/24/2014 02:37 PM, moxious wrote:

This makeObjectProperty method is getting called repetitively as the graph nodes I have contain metadata items in a HashMap - keys end up as names, values as values to this method. But to be clear: I am not even sure that this method of created Other objects is what's causing the problem - the exception doesn't really provide any insight. So part of the issue is first narrowing down where this is even happening.

— Reply to this email directly or view it on GitHub https://github.com/lucmoreau/ProvToolbox/issues/89#issuecomment-41280524.Web Bug from https://github.com/notifications/beacon/142275__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzk2NTg3MywiZGF0YSI6eyJpZCI6MzA3MDY1ODV9fQ==--ab7a96e939415f835839330e78fcdd138bd3e0e9.gif

Professor Luc Moreau Electronics and Computer Science tel: +44 23 8059 4487 University of Southampton fax: +44 23 8059 2865 Southampton SO17 1BJ email: l.moreau@ecs.soton.ac.uk United Kingdom http://www.ecs.soton.ac.uk/~lavm

moxious commented 10 years ago

I'm not familiar with how the serialization proceeds, I can't offer recommendations there.

However, when you first create the model instances (such as QualifiedNames) there should be some checking at that point. If a developer creates a QualifiedName object with invalid characters in them, it should either fix those characters (if possible) or throw an exception because the QualifiedName instance itself is invalid. If exceptions were thrown when invalid data was first provided, this would greatly simplify; it pushes the error from serialization time to model creation time. Because it's at model creation time, the developer would know precisely what they did wrong.

    public void test() { 
        System.out.println("Why is it OK for me to do this?");
        QualifiedName qn = new org.openprovenance.prov.xml.QualifiedName("!!##", "**!@#!@", "(!^$(!^#)$(!#^)");
        System.out.println(qn);
    }

This code actually works -- which I think is surprising. If I then use this qn in some other model class, it won't be until serialization time that I find there's a problem. And with the exception stack above, I won't be able to find the problem.

So I think my recommendation would be to do input validity checking in the model classes, to prevent it being possible to create such errors. In the case of this QualifiedName above, there's nothing to be done, maybe an exception is appropriate. In other cases, (like an element whose text node contains <, >, &amp;, some simple processing and substitution is appropriate (for the XML serialization)

lucmoreau commented 10 years ago

Hi,

Yes, some more work is required on QualifiedNames, to support the various serializations.

On 24/04/14 19:53, moxious wrote:

I'm not familiar with how the serialization proceeds, I can't offer recommendations there.

However, when you first create the model instances (such as QualifiedNames) there should be some checking at that point. If a developer creates a QualifiedName object with invalid characters in them, it should either fix those characters (if possible) or throw an exception because the QualifiedName instance itself is invalid. If exceptions were thrown when invalid data was first provided, this would greatly simplify; it pushes the error from serialization time to model creation time. Because it's at model creation time, the developer would know precisely what they did wrong.

 public void test() {
     System.out.println("Why is it OK for me to do this?");
     QualifiedName qn = new org.openprovenance.prov.xml.QualifiedName("!!##", "**!@#!@", "(!^$(!^#)$(!#^)");
     System.out.println(qn);
 }

This code actually works -- which I think is surprising. If I then use this |qn| in some other model class, it won't be until serialization time that I find there's a problem. And with the exception stack above, I won't be able to find the problem.

So I think my recommendation would be to do input validity checking in the model classes, to prevent it being possible to create such errors. In the case of this QualifiedName above, there's nothing to be done, maybe an exception is appropriate. In other cases, (like an element whose text node contains <, >, &, some simple processing and substitution is appropriate (for the XML serialization)

— Reply to this email directly or view it on GitHub https://github.com/lucmoreau/ProvToolbox/issues/89#issuecomment-41317948.Web Bug from https://github.com/notifications/beacon/142275__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzk4NDgwNCwiZGF0YSI6eyJpZCI6MzA3MDY1ODV9fQ==--1eb80f4ee7bd9847a01c38c91d9a56f60b39770d.gif

moxious commented 10 years ago

I've tracked down the source of the problem over a number of hours with a debugger. It turns out the issue is related to creating QualifiedName objects where the local part contained a space in it. That space was then causing the invalid character problem.

In the XML ProvFactory, it creates QualifiedNames via a pass-through method that just calls the QualifiedName constructor. That constructor does no input checking at all. There is precedent in other implementations of QName for constructors to throw InvalidArgumentException. The spec for QName parts includes the constraint that namespace URI is an anyURI and the local part must be an NCName. Checking those spec constraints for XML and throwing an InvalidArgumentException if the constructor is given bad data (which can never be validly serialized) would be an improvement.

Often I have complex code driving generation of documents that have dozens or hundreds of items in it, and it is very labor intensive to track down the source of the problem when the implementation is permitting behavior that isn't to spec.

lucmoreau commented 10 years ago

Hi,

Actually, the QName javadoc documentation explicitly states:

/The Namespace URI is //not//validated as a URI reference. The local part is //not//validated as a NCName as specified in Namespaces in XML./

I would be reluctant to add such tests as part of the constructor. However, I agree that it would be useful to have a function that checks whether a QualifiedName has got a valid syntax.

Luc

On 28/04/2014 14:25, moxious wrote:

I've tracked down the source of the problem over a number of hours with a debugger. It turns out the issue is related to creating QualifiedName objects where the local part contained a space in it. That space was then causing the invalid character problem.

In the XML ProvFactory, it creates QualifiedNames via a pass-through method that just calls the QualifiedName constructor. That constructor does no input checking at all. There is precedent in other implementations of QName http://docs.oracle.com/javase/7/docs/api/javax/xml/namespace/QName.html for constructors to throw |InvalidArgumentException|. The spec for QName parts http://www.w3.org/TR/xmlschema-2/#QName includes the constraint that namespace URI is an |anyURI| and the local part must be an |NCName|. Checking those spec constraints for XML and throwing an |InvalidArgumentException| if the constructor is given bad data (which can never be validly serialized) would be an improvement.

Often I have complex code driving generation of documents that have dozens or hundreds of items in it, and it is very labor intensive to track down the source of the problem when the implementation is permitting behavior that isn't to spec.

— Reply to this email directly or view it on GitHub https://github.com/lucmoreau/ProvToolbox/issues/89#issuecomment-41557275.Web Bug from https://github.com/notifications/beacon/142275__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxNDMxMDcyNCwiZGF0YSI6eyJpZCI6MzA3MDY1ODV9fQ==--8e9e7088cdda1bb6a1f6f3261083396817f034e5.gif

Professor Luc Moreau Electronics and Computer Science tel: +44 23 8059 4487 University of Southampton fax: +44 23 8059 2865 Southampton SO17 1BJ email: l.moreau@ecs.soton.ac.uk United Kingdom http://www.ecs.soton.ac.uk/~lavm

moxious commented 10 years ago

Yes, the QName implementation doesn't do this checking either, which is strange, considering that other related standard java classes like URL & URI throw exceptions out of their constructors. Doubly strange in that QName already throws an Exception on null input. I don't know why this is but the docs also say blank prefixes are permitted to maintain backwards compatibility, so that may be part of the issue.

But I'm curious, why the reluctance? When would it be acceptable, in the XML serialization library, for a QualifiedName to contain a character that can't be serialized to XML?

moxious commented 10 years ago

I've written my own implementation, subclassing QualifiedName. It can be found here: https://github.com/plus-provenance/plus/blob/master/src/main/java/org/mitre/provenance/plusobject/prov/VerifiedXMLQualifiedName.java