Closed grosscol closed 9 years ago
Can you squash your commits, @grosscol?
I need height and width for images and videos. Does this discard that?
@jcoyne should height and width be included in the default config of metadata to extract?
Currently this just works by a simple mapping which is barely more complex than a hash of property values => element names. The config works well enough for parsing metadata from fits xml output.
So long as the class has a height and width property and the fits xml contains the data of interest, it should be straightforward to amend a class or instance's config so that it's picked up with a call to characterize.
Fits doesn't get installed in travis. Skip the failing test in Travis?
Please do a brief write-up on why you're doing this work. Is this an improvement over what we already have?
This PR was in response to #89. As @jcoyne points out, the methodology of this PR warrants some explanation. Specifically, what is being done here and why.
The two most glaring changes are:
OM is a very robust gem that can parse any XML, and has extensive configuration that would permit extraction of any element or attribute value. It uses a DOM parser, is schema aware, and is namespace aware. In short, OM is an all terrain industrial loader.
It's a piece of heavy construction equipment that appears to have been primarily used to drive down to the same fits.xml store. If the most that is going to be done with characterization is parsing a specific version of fits.sh output, it seems like a lighter solution might be more applicable. The non-OM solution was a SAX parser with a simple configuration.
Is it better? nope. I was under the impression that we wanted to get away from fits datastreams in favor of defining properties directly on the object, and this seemed like a reasonable approach for that.
@escowles pointed me to a document produced by the hydra technical metadata group. It detailed some core technical metadata. It is a small default set that could be added to as needed by classes further downstream.
Having height and width methods on a GenericFile that contains an audio file was unsettling. Interrogating an object for the height and width I was not sure if characterization could not determine those, or if they were not supposed to be there in the first place. So the implementation in this PR punts that decision to the downstream projects and sticks to a minimal set of technical metadata.
:+1: to using SAX and having a smaller set of default properties. I think having a small default set and an example of how to extend it is the right approach, and a good approach for implementors to emulate.
@grosscol Now we are adding another piece of code and not really removing anything (OM is a dependency of AF). And the SAX parser isn't a big benefit on a tiny xml file such as that omitted by Fits. Most importantly, we are already using several fields defined by the om definition that are not property being captured by this new implementation (image height and width, video height, width and duration, audio duration). Finally, I'm not seeing any direction on how to extend this approach provided with this PR.
@jcoyne I was removing the cognitive overhead of sorting out OM. The terminology configuration was difficult to understand, and is designed around using and keeping xml datastreams.
I agree that DOM vs FITS for small files is half of one six of the other. I was also working on an approach using OM. It ends up building an entire OM XML Document only to discard it. That seemed like building an entire scaffolding when we would only need a ladder.
Making a sub class of Generic file with additional properties and mappings is how I would expect downstream applications to add the properties they require. I don't see a benefit for having video height on a pdf or audio file. At best, that's not harmful.
IMG_FITS_TO_RDF = { image_width: ["imageWidth", "toolname=Jhove"],
image_height: ["imageHeight", "toolname=Jhove"],
color_space: ["colorSpace", "toolname=Jhove"] }
class Demo < Hydra::Works::GenericFile::Base
include Hydra::Works::GenericFile::Characterization
property :image_width, predicate: ::RDF::URI.new('http://NeedAPredicate.org/ns/hydra/imageW'), multiple: false
property :image_height, predicate: ::RDF::URI.new('http://NeedAPredicate.org/ns/hydra/imageH'), multiple: false
property :jhove_time, predicate: ::RDF::URI.new('http://NeedAPredicate.org/ns/hydra/executionTime'), multiple: false
parse_config.merge!(IMG_FITS_TO_RDF)
end
I have a branch where I was using OM, and then trying to get the values from the resulting OM XML Document. I can go back to that branch and get it to an equivalent state.
I'm absolutely not attached to OM. I do think we should support a base list of predicates that can support all types of media because this is commonly for a self deposit system. We can't say only use image predicates if its an image, because it is FITS itself that tells us so.
Still a work in progress.