openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

JhoveView: Markup Parsing Error: dynPolLoginRedirect.html #116

Open carlwilson opened 8 years ago

carlwilson commented 8 years ago

Dev Effort

1D

Description

Ubuntu 10.04.4 LTS JHOVE 1.10

Running JhoveView I get an error, however it proceeds to start as expected. Command and

Trace below.

Command: java -jar JhoveView.jar

[Warning] jhove.conf:6:73: schema_reference.4: Failed to read schema document 'http://hul.harvard.edu/ois/xml/xsd/jhove/jhoveConfig.xsd', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.
[Error] jhove.conf:6:73: cvc-elt.1: Cannot find the declaration of element 'jhoveConfig'.
[Fatal Error] dynPolLoginRedirect.html:1:3: The markup in the document preceding the root element must be well-formed.
ross-spencer commented 4 years ago

Okay, this is an interesting one. I hadn't realized I logged the original ticket. That was useful for context though. I realized what we were seeing when that was logged was the corporate firewall getting in the way of JHOVE trying to communicate with the Harvard servers to download the configuration schema to then validate the configuration document.

Actually, we can now recreate it. I found a website running the same firewall and then pointed the configuration at it:

<?xml version="1.0" encoding="UTF-8"?>
<jhoveConfig version="1.0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns="http://hul.harvard.edu/ois/xml/ns/jhove/jhoveConfig"
 xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove/jhoveConfig
                     https://web.archive.org/web/20190917103633/http://nsa.stuart-hall.org/dynPolLoginRedirect.html">

image

[Fatal Error] dynPolLoginRedirect.html:1:3: The markup in the document preceding the root element must be well-formed.
May 02, 2020 8:03:51 PM JhoveView errorAlert
WARNING: Error parsing configuration file: The markup in the document preceding the root element must be well-formed.

We can also mess with the configuration with other sources of non-XSD:

Pointing it at Github:

<?xml version="1.0" encoding="UTF-8"?>
<jhoveConfig version="1.0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns="http://hul.harvard.edu/ois/xml/ns/jhove/jhoveConfig"
 xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove/jhoveConfig
                     https://github.com/">

image

[Error] :32:68: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'The world’s leading software development platform · GitHub'.
[Fatal Error] :79:59: Attribute name "data-pjax-transient" associated with an element type "meta" must be followed by the ' = ' character.
May 02, 2020 8:04:53 PM JhoveView errorAlert
WARNING: Error parsing configuration file: Attribute name "data-pjax-transient" associated with an element type "meta" must be followed by the ' = ' character.

Pointing it at the original SourceForge issue:

<?xml version="1.0" encoding="UTF-8"?>
<jhoveConfig version="1.0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns="http://hul.harvard.edu/ois/xml/ns/jhove/jhoveConfig"
 xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/jhove/jhoveConfig
                     https://sourceforge.net/p/jhove/bugs/51/">

image

[Error] :35:88: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'JHOVE / Bugs / #51 JhoveView: Markup Parsing Error: dynPolLoginRedirect.html'.
[Error] :52:36: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'if (!window.SF) { window.SF = {}; }'.
[Error] :53:21: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'SF.sandiego = false;'.
[Error] :54:27: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'SF.sandiego_chrome = true;'.
[Error] :55:35: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'SF.cdn = "https://a.fsdn.com/con";'.
[Error] :87:23: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'div.moderate {'.
[Error] :88:24: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'color:grey;'.
[Error] :89:10: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw '}'.
[Error] :114:25: s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw '/* make URL '.
[Fatal Error] :114:26: The entity name must immediately follow the '&' in the entity reference.
May 02, 2020 8:05:43 PM JhoveView errorAlert
WARNING: Error parsing configuration file: The entity name must immediately follow the '&' in the entity reference.

Impact

So for all that playing about the impact here is as one might imagine is that jhove-view opens correctly, but then we can't do much with the window, i.e. none of the modules are loaded so if we drag and drop into the window a processing pop-up appears but there's no processing happening as far as I can see. If you try selecting a module there are no entries.

What do we do?

Rightfully this is marked as a low-priority task, but are there some things we might consider doing @carlwilson? e.g. to make this more robust?

A couple of ideas:

  1. XML validation for a config document seems like quite a high bar. Can we skip validation of the config entirely? Is XML still the right choice for a config document?

  2. Do we ask JHOVE to exit entirely and more cleanly, maybe with a clearer message? I.e. once we know we haven't the schema to validate against, we can let the user know that? Right now we're asking the user to react to an unfiltered message from the XML validation, we can parse and translate that to say the config validation didn't work because the schema was invalid?

  3. Something else? Maybe load a default configuration? (A downside of that is that certain modules may not be installed to be accessed.)

I feel like there is room to do something here, but I'm not sure the appetite of the project.

carlwilson commented 4 years ago

Nice work @ross-spencer. JHOVE config is an area that requires a little more work as the codes quite old. I do agree about the validation but am happy to take a look a little further down the line once the final stream is underway in a week.

And add me to the assigned list.

MartinSpeller commented 4 years ago

JhoveView: Markup Parsing Error: dynPolLoginRedirect.html #116 - Assigned to ross-spencer

ross-spencer commented 4 years ago

@carlwilson Nice. Shall we update the ticket name now too do you reckon, maybe, to begin, something like: JhoveView: Markup Parsing Error: when the jhoveConfig.xsd schema location is improperly redirected?