pdvrieze / xmlutil

XML Serialization library for Kotlin
https://pdvrieze.github.io/xmlutil/
Apache License 2.0
363 stars 30 forks source link

Can't read Node.PROCESSING_INSTRUCTION with StAX #166

Closed StefanOltmann closed 1 year ago

StefanOltmann commented 1 year ago

There seems to be an limitation by reading XML code on the JVM.

I use this code:

fun parseDocumentFromString(input: String): Document {

    val writer = DomWriter()

    val reader = XmlStreaming.newReader(input)

    do {
        val event = reader.next()
        reader.writeCurrent(writer)
    } while (event != EventType.END_DOCUMENT)

    return writer.target
}

I want to read this file: test_xmp.txt

I get this error message:

Current state PROCESSING_INSTRUCTION is not among the statesCHARACTERS, COMMENT, CDATA, SPACE, ENTITY_REFERENCE, DTD valid for getText() 
java.lang.IllegalStateException: Current state PROCESSING_INSTRUCTION is not among the statesCHARACTERS, COMMENT, CDATA, SPACE, ENTITY_REFERENCE, DTD valid for getText() 
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getText(XMLStreamReaderImpl.java:1161)
    at nl.adaptivity.xmlutil.StAXReader.getText(StAXReader.kt:246)
    at nl.adaptivity.xmlutil.EventType$PROCESSING_INSTRUCTION.writeEvent(EventType.kt:208)
    at nl.adaptivity.xmlutil.XmlReaderUtil__XmlReaderKt.writeCurrent(XmlReader.kt:466)
    at nl.adaptivity.xmlutil.XmlReaderUtil.writeCurrent(Unknown Source)

This is the original code I try to replace, which works perfectly fine:

private fun parseInputSource(input: String): Document {

    val source = org.xml.sax.InputSource(StringReader(input))

    val factory = javax.xml.parsers.DocumentBuilderFactory.newInstance()

    factory.isNamespaceAware = true
    factory.isIgnoringComments = true
    factory.isExpandEntityReferences = false

    val builder = factory.newDocumentBuilder()
    builder.setErrorHandler(null)
    return builder.parse(source)
}

Maybe the underlying tech should be the SAX parser instead of StAX? Or maybe an option to switch between them?

StefanOltmann commented 1 year ago

In addition to that problem I miss the following things in the expect:

I try to translate this class to Kotlin: https://github.com/drewnoakes/adobe-xmp-core/blob/master/com/adobe/internal/xmp/impl/ParseRDF.java

pdvrieze commented 1 year ago

I found the parsing problem. It is in the way processing instructions are translated from stax to the library. I've pushed a fix to dev, but in the meantime you could use XmlStreaming.newGenericReader instead (it just creates a pure Kotlin reader - based upon the Android parser) that does not have this issue. As to the missing functions, they are only valid on element, so the common version uses that (you should be able to cast to element/check whether it is an element to get them).

StefanOltmann commented 1 year ago

@pdvrieze Thank you for looking into that so quick.

I tried the XmlStreaming.newGenericReader instead, but with the document linked above it also throws an exception:

Caused by: nl.adaptivity.xmlutil.XmlException: The element is not text, it is: PROCESSING_INSTRUCTION
    at app//nl.adaptivity.xmlutil.core.KtXmlReader.getText(KtXmlReader.kt:769)
    at app//nl.adaptivity.xmlutil.EventType$PROCESSING_INSTRUCTION.writeEvent(EventType.kt:208)
    at app//nl.adaptivity.xmlutil.XmlReaderUtil__XmlReaderKt.writeCurrent(XmlReader.kt:466)
    at app//nl.adaptivity.xmlutil.XmlReaderUtil.writeCurrent(Unknown Source)

It occurs, because EventType.PROCESSING_INSTRUCTION.isTextElement evaluates to false.

https://github.com/pdvrieze/xmlutil/blob/d8e591e914a8bb2b1e7a63e057001c41ff5995df/core/src/commonMain/kotlin/nl/adaptivity/xmlutil/core/KtXmlReader.kt#L768C4-L768C4

pdvrieze commented 1 year ago

I've pushed various fixes to dev that should now have addressed this (actually with a test). You can either use the snapshot, or skip the processing instructions.

StefanOltmann commented 1 year ago

@pdvrieze Thank you for the fix and the snapshot.

I tried to skip the processing instructions, which gives me a Document that the code after that can't work with, because the information seems to be needed.

I used the snapshot 0.86.1-SNAPSHOT that you just created.

With my file above this results in a different error:

Caused by: nl.adaptivity.xmlutil.XmlException: Document already started
    at app//nl.adaptivity.xmlutil.DomWriter.processingInstruction(DomWriter.kt:216)
    at app//nl.adaptivity.xmlutil.EventType$PROCESSING_INSTRUCTION.writeEvent(EventType.kt:210)
    at app//nl.adaptivity.xmlutil.XmlReaderUtil__XmlReaderKt.writeCurrent(XmlReader.kt:472)
    at app//nl.adaptivity.xmlutil.XmlReaderUtil.writeCurrent(Unknown Source)
    at app//com.adobe.xmp.impl.DomParserKt.parseDocumentFromString(DomParser.kt:37)

Interestingly this also happens if I skip the EventType.START_DOCUMENT, which comes first.

I still use newGenericReader() as newReader() has the old problem:

Caused by: java.lang.IllegalStateException: Current state PROCESSING_INSTRUCTION is not among the statesCHARACTERS, COMMENT, CDATA, SPACE, ENTITY_REFERENCE, DTD valid for getText() 
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getText(XMLStreamReaderImpl.java:1161)
    at nl.adaptivity.xmlutil.StAXReader.getText(StAXReader.kt:246)
    at nl.adaptivity.xmlutil.EventType$PROCESSING_INSTRUCTION.writeEvent(EventType.kt:210)
    at nl.adaptivity.xmlutil.XmlReaderUtil__XmlReaderKt.writeCurrent(XmlReader.kt:472)
    at nl.adaptivity.xmlutil.XmlReaderUtil.writeCurrent(Unknown Source)
    at com.adobe.xmp.impl.DomParserKt.parseDocumentFromString(DomParser.kt:37)
    ... 47 more

I assume the fix you did here is not part of the snapshot.

StefanOltmann commented 1 year ago

@pdvrieze Should I close this and create a new issue for the Document already started problem?

pdvrieze commented 1 year ago

Not needed. They are all related issues.

pdvrieze commented 1 year ago

I've just pushed new version that properly reworks processing instruction handling (rather that the previous version). It should work for your context (and retain the processing instructions).

StefanOltmann commented 1 year ago

Thank you a lot. The error is gone. :)