openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
161 stars 78 forks source link

XML Module: White spaces are required between publicId and systemId.: Line = 1, Column = 50 #227

Open ross-spencer opened 7 years ago

ross-spencer commented 7 years ago

Dev Effort

1D

Description

Attached are two versions of the same XML, and the corresponding JHOVE output. The example comes from Roland at ethz.

I've recreated what he was seeing, and can't quite understand the reason for the error.

White spaces are required between publicId and systemId.: Line = 1, Column = 50

Given this doesn't seem to be about the XML itself with the line number not changing when I add the XML declaration to the original document, I wonder if it is something to do with the external dependencies.

Extracting the XSD, it seems to only refer to:

http://www.loc.gov/standards/mets/version18/mets.xsd
http://www.loc.gov/standards/mods/v3/mods-3-5.xsd
http://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd
http://www.loc.gov/standards/mix/mix20/mix20.xsd

I can't find systemid or publicid in any. So am not sure what else to check at this point.

jhove-export_mets_2017.no-declaration.xml.txt export_mets_2017_no_declaration.xml.txt export_mets_2017_with_declaration.xml.txt jhove-export_mets_2017.declaration.xml.txt

ross-spencer commented 7 years ago

N.B. There do seem to be three instances each of hard-coded literals with these values (publicid, systemid) in the XML Module, e.g.

https://github.com/openpreserve/jhove/blob/08baeef92fff3c15551c17c71f614089f1bed4bc/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/XmlModule.java#L757

https://github.com/openpreserve/jhove/blob/08baeef92fff3c15551c17c71f614089f1bed4bc/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/XmlModule.java#L754

anjackson commented 7 years ago

The error is a SAX parser error, and yes, it is bubbling up from one of the dependencies. Specifically:

$ curl http://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head><style>#watch-discussion, #watch7-discussion, ytd-comments { display: none; }</style>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd">here</a>.</p>
<hr>
<address>Apache/2.4.10 (Linux/SUSE) Server at www.danrw.de Port 80</address>
</body></html>

i.e. the danrw-contract-1.xsd schema is moved. If I update the schema declaration to use https://www.danrw.de/schemas/contract/v1/danrw-contract-1.xsd it validates. It's surprising that the SAX parser does not follow redirects when finding XSDs.

anjackson commented 7 years ago

Okay, here's a relevant StackOverflow Q that has a solution: http://stackoverflow.com/questions/29696638/how-to-validate-xml-with-schema-urls-that-return-http-301

rolandsuri commented 7 years ago

At the ETH Data Archive we got two additional files that may cause the same issue in JHOVE as the one described above. As in the previously attached file, JHOVE considers these two files as not well-formed because the path to the xsd schema is automatically redirected by the browser from http to https.

The attached File Dia_002-034_10776.xml is considered by JHOVE to be not well-formed (Dia_002-034_10776.xml.txt). Again the JHOVE error message is “space required between publicId and systemID“ (Dia_002- 034_10776_JHOVEreport.xml.txt). The URI to the schema
http://www.e-pics.ethz.ch/index/rosetta/schema/epics_rosetta_schema.xsd is redirected in my browser to the corresponding https location. To avoid the redirect, I replaced the path http://www.e-pics.ethz.ch in all its five instances with https://www.e-pics.ethz.ch (Dia_002-034_10776_httpReplacedByHttpsForAllePicsPathes.xml.txt). This file is valid and well-formed.

The file 10539670.xml is considered by JHOVE to be not well-formed (10539670.xml.txt). The error message is “premature end of file” (10539670_JHOVEreport.xml.txt). The file contains an invalid URL to an XSD File: http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml . I guess even with an invalid path to a schema, the file should be well-formed? The URL is redirected in my browser from http to https. If I adapt the XML File by replacing in the previous URL http by https (10539670_httpReplacedByHttpsInwwwAbbyyCom.xml.txt), JHOVE reports the file (after some minutes of computing time) as well-formed but not valid with an error message “cannot find declaration of element ‘document’” (10539670_httpReplacedByHttpsInwwwAbbyyCom_JHOVEreport.xml.txt).

MartinSpeller commented 4 years ago

XML Module: White spaces are required between publicId and systemId.: Line = 1, Column = 50 #227 - Assigned to TBA