openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
168 stars 79 forks source link

XML Extraction failing wih SaxParseException #745

Open rgalv opened 2 years ago

rgalv commented 2 years ago

Hi,

I have an XML files that is failing with below error.

Error/s returned during metadata extraction (SaxParseException: java.lang.ClassCastException: class sun.net.www.protocol.file.FileURLConnection cannot be cast to class java.net.HttpURLConnection (sun.net.www.protocol.file.FileURLConnection and java.net.HttpURLConnection are in module java.base of loader 'bootstrap'),Failed to retrieve extractor properties) Agent: JHOVE 1.24.2, XML-hul 1.5.1 , Plugin Version 6.0

Can you please advise the cause of this error and how can we fix this? Thanks.

david-russo commented 2 years ago

I believe this may be fixed in the next release of JHOVE (v1.26), of which there is currently a release candidate available if you'd like to test and see before the final release.

carlwilson commented 2 years ago

We believe that this is fixed in the recent v1.26 release. Would it be possible to test this and let us know if it's fixed please @rgalv. Even better would it be possible to post the test file on here and we can test/add it to our regression tests suite.

leninoc commented 1 year ago

hi, we are seeing strange behaviour which might possibly be related to this. We have an alto xml file which has a link to local xsd file in it. When I run standalone jhove 1.24 (xml-hul 1.5.1) I get XML-HUL-3 SaxException cause: java.lang.ClassCastException. WHen i run same file in 1.26 jhove (xml-hul 1.5.2) then I get XML-HUL-1 SAXParseException error, plus 67748 (!!!) InfoMessages with this SubMessage: SubMessage: schema_reference.4: Failed to read schema document '//docstorage2/impdata1/docWORKS_KBNL/schema/alto-1-2.xsd', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not .

Submessage is correct, its ok that the standalone jhove cannot read the local xsd. In Rosetta we have a mechanism which makes this possible for jhove plugins (via jhove.conf etc.).

Conclusion of some sort - it looks like the original #745 issue was fixed, but now JHOVE 1.26 is complaining about different thing, with huge number of repeating infomessages. The number of infomessages is using a lot of resources in our system. In Rosetta it seems that jhove 1.26 plugin cannot get to read the local xsd, unline previous Jhove 1.24 or 1.17 based plugins. Just thought i would share this, the xml file in question is attached as a zip 0003.zip best Jan