openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

FIX: XML Entity file handling #701

Open carlwilson opened 2 years ago

carlwilson commented 2 years ago

Fixes #700 output is now a more informative:

  Status: Not well-formed
  ErrorMessage: File not found: /home/cfw/proj/opf/jhove/article.dtd (No such file or directory)
   ID: XML-HUL-10
  MIMEtype: text/xml
pwinckles commented 2 years ago

Observation, if you're going to use an automated code formatter, it would make sense to run it on the entire project to create a baseline. This would allow for cleaner PRs and ease the burden of resolving spurious merge conflicts.

pwinckles commented 2 years ago

I've spent a fair bit of time looking at this function the past couple of days. Here is the modified implementation that I arrived at: https://github.com/UW-Madison-Library/jhove/blob/48d083b2394fc8b9426d9567744eff519284b830/jhove-modules/xml-hul/src/main/java/edu/harvard/hul/ois/jhove/module/xml/XmlModuleHandler.java#L332

Things to note:

  1. The code in this PR removes the protocol redirect handling. This means that it will no longer be able to resolve schemas like https://www.openarchives.org/OAI/2.0/oai_dc.xsd, which imports http://dublincore.org/schemas/xmls/simpledc20021212.xsd and is in turn redirected to https://dublincore.org/schemas/xmls/simpledc20021212.xsd
  2. The original code had what I thought was a bug in that after making the initial HTTP request it did not set ent, instead it only set ent if the request encountered a redirect. However, this turned out not to be a bug. The sole purpose of the code was actually to resolve the redirect. If resolveEntity() returns null, the calling code will attempt to resolve the systemId itself, which it is able to do so long as there are no redirects.
  3. Finally, resolveEntity() should set the public id and system id on the returned InputSource as these values are used to resolve relative child entities.
david-russo commented 2 years ago

Hi Peter, thanks for your super helpful review. This all sounds a little familiar, since I did a review of the XML module code a year or two ago, but can barely remember the details now. I fixed a few bugs, and did a lot of house keeping. The PR (#634) is still awaiting integration, but you might find it of interest, if only to prevent duplicating efforts in places.

A fix for this issue which still retains the old redirection logic was included in commit d223a28 of that PR. I'm not sure if it addresses all of your concerns, since it's been too long for me to remember, but let me know if it doesn't, and maybe I can amend that older PR.

pwinckles commented 2 years ago

@david-russo Yeah, it looks like your old PR does address the same issue as is addressed in this PR. It does still have the bug when resolving relative entities but that's a different problem.

I haven't looked at your PR in detail, but it looks like there's some good stuff in there, including a few changes that I have also made. It would be really nice if PRs didn't languish here for years, forcing us to duplicate each other's work and creating an ever growing merge problem for PR authors if/when PRs ever start to be merged again.

david-russo commented 2 years ago

Cheers. Does this comment relate to the relative entity resolution issue you're describing? If so, from what I remember/wrote, it sounds like there's a hacky solution in the module at the moment, but a proper fix may require a change to what information JHOVE passes its modules.

pwinckles commented 2 years ago

Yes, but it can be addressed in some (perhaps not all) cases, simply by setting the system id on the returned InputSource. See here

I'll create a ticket for it with a demonstration of the problem later today.

[Edit] That is for entities that are relative a parent. This fix has no affect on relative entities without parents.