metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
72 stars 34 forks source link

OAI-Pmh fails with SAXParseException #334

Closed dr0i closed 4 years ago

dr0i commented 4 years ago

Reported by @hagbeck :

"https://eldorado.tu-dortmund.de/oai/request" |
open-oaipmh(metadataPrefix="oai_dc", dateFrom="2020-09-19", dateUntil="2020-09-19") |
decode-xml |
handle-generic-xml |
encode-formeta(style="multiline") |
write("stdout");

results in: SAXParseException; lineNumber: 194; columnNumber: 90; Zeichenreferenz "&#55349" ist ein ungültiges XML-Zeichen

fsteeg commented 4 years ago

Here is the example provided by @hagbeck to show the XML should be valid:

https://eldorado.tu-dortmund.de/oai/request?verb=ListRecords&from=2020-09-19&until=2020-09-19&metadataPrefix=mods

In the browser this shows the result of their XSLT transformation, see 'view source' for the XML. However, I think this only works because their XML processor is lenient, because the XML actually declares itself as ?xml version="1.0", but arbitrary Unicode characters are only supported in XML 1.1, see https://www.w3.org/TR/xml11/#sec-xml11:

Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0.

(I tried to update Xalan in metafacture-biblio to 2.7.2 as suggested by @hagbeck, but that makes no difference – which I think makes sense since there is no XSLT involved in the sample Flux above).

So I see 3 possible approaches:

dr0i commented 4 years ago

I try to update the OCLC library to use 1.1 instead of 1.0.

dr0i commented 4 years ago

Made the OCLC library to output xml version 1.1. That version was recognized by the AbstractSAXParser where the Exception is thrown. But it didn't change a thing. Stumbled about https://stackoverflow.com/questions/15634536/java-sax-parser-mangles-attributes-for-xml-1-1 where it is recommended to not use JDK XML parser but to switch to Apache Xerces XML parser. This should be tried.

fsteeg commented 4 years ago

I don't understand this:

Made the OCLC library to output xml version 1.1. That version was recognized by the AbstractSAXParser where the Exception is thrown. But it didn't change a thing.

It throws the same exception although it's now getting XML 1.1?

dr0i commented 4 years ago

Exactly.

dr0i commented 4 years ago

Ah, nasty nasty, dependency hell ... Getting rid of xalan-serializer, which is a dependency of xalan, but not needed by the OAIPmh, fixes it, see https://stackoverflow.com/questions/11952289/serializing-supplementary-unicode-characters-into-xml-documents-with-java . Will provide a proper config tomorrow.

dr0i commented 4 years ago

It's like this: if in build.gradle the xalan:xalan:2.7.2 is implemented, I see in the used libraries in the IDE: xalan, xml-apis and serializer. Run fails. If only the xalan library (shipped with oaiharvester) is used, the xalan appears in IDE, but no xml-apis nor serializer. Run succeeds. I.e., all data can be retrieved, but only when encoding is set to ISO-8859-1, which breaks UTF8-characters (like mathematic symbols). I don't know how to cope with that properly. As @fsteeg said your oaipmh-server serves xml version=1.0. This seems to be used by the parsers of the oclc-oaipmh (and OaiPmhOpener wraps its own xml-header around the output . One could set xml version=1.1 but this is not used by the parsers, just as xml-header for the output.) I will could to manipulate the xml-source to check this, but maybe it would be even possible for you @hagbeck to set the xml-header to 1.1 ?

hagbeck commented 4 years ago

We will check this.

But yesterday I discovered in an other context, that the current version of xmllint in Ubuntu 20.04 doesn't support the xml version=1.1. It seems that this solution isn't stable enough for all use cases, isnt' it?

dr0i commented 4 years ago

ACK. Also, I just let the OAIPmh ran as standalone. Surprise - it I works perfectly with your OAI-Server ! So it must be some library dependency and this should somehow be solvable.

dr0i commented 4 years ago

@hagbeck try branch 334-fixEncoding, this should work. It basically uses some older libraries and excludes some others explicitly.

hagbeck commented 4 years ago

I've tried it using flux (open-oaipmh(metadataPrefix="mods", dateUntil="2020-10-23")) and getting the following error. Changing date or metadataPrefix results in the same error.

Exception in thread "main" org.metafacture.commons.reflection.ReflectionException: class could not be instantiated: class org.metafacture.metamorph.Metamorph
        at org.metafacture.commons.reflection.ConfigurableClass.newInstance(ConfigurableClass.java:105)
        at org.metafacture.commons.reflection.ObjectFactory.newInstance(ObjectFactory.java:67)
        at org.metafacture.flux.parser.FluxProgramm.createElement(FluxProgramm.java:70)
        at org.metafacture.flux.parser.FluxProgramm.addElement(FluxProgramm.java:81)
        at org.metafacture.flux.parser.FlowBuilder.pipe(FlowBuilder.java:736)
        at org.metafacture.flux.parser.FlowBuilder.flowtail(FlowBuilder.java:514)
        at org.metafacture.flux.parser.FlowBuilder.flow(FlowBuilder.java:226)
        at org.metafacture.flux.parser.FlowBuilder.flux(FlowBuilder.java:122)
        at org.metafacture.flux.FluxCompiler.compileFlow(FluxCompiler.java:54)
        at org.metafacture.flux.FluxCompiler.compile(FluxCompiler.java:42)
        at org.metafacture.runner.Flux.main(Flux.java:79)
Caused by: java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at org.metafacture.commons.reflection.ConfigurableClass.newInstance(ConfigurableClass.java:101)
        ... 10 more
Caused by: org.metafacture.metamorph.MetamorphException: Error while building the Metamorph transformation pipeline: javax.xml.transform.TransformerException: org.xml.sax.SAXException: org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
        at org.metafacture.metamorph.Metamorph.buildPipeline(Metamorph.java:191)
        at org.metafacture.metamorph.Metamorph.<init>(Metamorph.java:179)
        at org.metafacture.metamorph.Metamorph.<init>(Metamorph.java:126)
        at org.metafacture.metamorph.Metamorph.<init>(Metamorph.java:116)
        at org.metafacture.metamorph.Metamorph.<init>(Metamorph.java:112)
        ... 15 more
Caused by: org.metafacture.framework.MetafactureException: javax.xml.transform.TransformerException: org.xml.sax.SAXException: org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
        at org.metafacture.metamorph.xml.DomLoader.process(DomLoader.java:136)
        at org.metafacture.metamorph.xml.DomLoader.parse(DomLoader.java:70)
        at org.metafacture.metamorph.AbstractMetamorphDomWalker.walk(AbstractMetamorphDomWalker.java:108)
        at org.metafacture.metamorph.AbstractMetamorphDomWalker.walk(AbstractMetamorphDomWalker.java:104)
        at org.metafacture.metamorph.Metamorph.buildPipeline(Metamorph.java:187)
        ... 19 more
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXException: org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
        at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:449)
        at org.metafacture.metamorph.xml.DomLoader.process(DomLoader.java:134)
        ... 23 more
Caused by: org.xml.sax.SAXException: org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
        at org.apache.xml.utils.DOMBuilder.startElement(DOMBuilder.java:322)
        at org.apache.xalan.transformer.TransformerIdentityImpl.startElement(TransformerIdentityImpl.java:1020)
        at java.xml/org.xml.sax.helpers.XMLFilterImpl.startElement(XMLFilterImpl.java:551)
        at java.xml/org.xml.sax.helpers.XMLFilterImpl.startElement(XMLFilterImpl.java:551)
        at java.xml/org.xml.sax.helpers.XMLFilterImpl.startElement(XMLFilterImpl.java:551)
        at java.xml/org.xml.sax.helpers.XMLFilterImpl.startElement(XMLFilterImpl.java:551)
        at org.metafacture.metamorph.xml.LocationAnnotator.startElement(LocationAnnotator.java:80)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:510)
        at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.startElement(XMLSchemaValidator.java:832)
        at java.xml/com.sun.org.apache.xerces.internal.xinclude.XIncludeHandler.startElement(XIncludeHandler.java:1001)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:374)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl$NSContentDriver.scanRootElementHook(XMLNSDocumentScannerImpl.java:613)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3063)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:836)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
        at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
        at java.xml/org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
        at java.xml/org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
        at java.xml/org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
        at org.metafacture.metamorph.xml.LexicalHandlerXmlFilter.parse(LexicalHandlerXmlFilter.java:51)
        at java.xml/org.xml.sax.helpers.XMLFilterImpl.parse(XMLFilterImpl.java:357)
        at org.metafacture.metamorph.xml.LexicalHandlerXmlFilter.parse(LexicalHandlerXmlFilter.java:51)
        at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:432)
        ... 24 more
Caused by: org.w3c.dom.DOMException: NAMESPACE_ERR: Es wurde versucht, ein Objekt auf eine Weise zu erstellen oder zu ändern, die falsch in Bezug auf Namespaces ist.
        at java.xml/com.sun.org.apache.xerces.internal.dom.AttrNSImpl.setName(AttrNSImpl.java:109)
        at java.xml/com.sun.org.apache.xerces.internal.dom.AttrNSImpl.<init>(AttrNSImpl.java:78)
        at java.xml/com.sun.org.apache.xerces.internal.dom.CoreDocumentImpl.createAttributeNS(CoreDocumentImpl.java:2140)
        at java.xml/com.sun.org.apache.xerces.internal.dom.ElementImpl.setAttributeNS(ElementImpl.java:652)
        at org.apache.xml.utils.DOMBuilder.startElement(DOMBuilder.java:307)
        ... 52 more
dr0i commented 4 years ago

@hagbeck I updated the xalan-library. Can you try again please?

hagbeck commented 4 years ago

:+1: It work's fine now!

dr0i commented 4 years ago

Resolved by https://github.com/metafacture/metafacture-core/pull/335. Closing.