samvera / hydra-works

A ruby gem implementation of the PCDM Works domain model based on the Samvera software stack
Other
24 stars 14 forks source link

Characterization sans om #199

Closed grosscol closed 9 years ago

grosscol commented 9 years ago

Still a work in progress.

mjgiarlo commented 9 years ago

Can you squash your commits, @grosscol?

jcoyne commented 9 years ago

I need height and width for images and videos. Does this discard that?

grosscol commented 9 years ago

@jcoyne should height and width be included in the default config of metadata to extract?

Currently this just works by a simple mapping which is barely more complex than a hash of property values => element names. The config works well enough for parsing metadata from fits xml output.

So long as the class has a height and width property and the fits xml contains the data of interest, it should be straightforward to amend a class or instance's config so that it's picked up with a call to characterize.

tampakis commented 9 years ago

Fits doesn't get installed in travis. Skip the failing test in Travis?

jcoyne commented 9 years ago

Please do a brief write-up on why you're doing this work. Is this an improvement over what we already have?

grosscol commented 9 years ago

This PR was in response to #89. As @jcoyne points out, the methodology of this PR warrants some explanation. Specifically, what is being done here and why.

The two most glaring changes are:

  1. Using simple SAX parser instead of OM to parse fits output to datastream.
  2. Using smaller set of properties as the base set instead of more comprehensive list.
SAX instead of OM
TL;DR It seemed simpler to write a SAX parser for fits.xml than to figure out how to use OM to set the properties directly on the GenericFile.

OM is a very robust gem that can parse any XML, and has extensive configuration that would permit extraction of any element or attribute value. It uses a DOM parser, is schema aware, and is namespace aware. In short, OM is an all terrain industrial loader.

It's a piece of heavy construction equipment that appears to have been primarily used to drive down to the same fits.xml store. If the most that is going to be done with characterization is parsing a specific version of fits.sh output, it seems like a lighter solution might be more applicable. The non-OM solution was a SAX parser with a simple configuration.

Is it better? nope. I was under the impression that we wanted to get away from fits datastreams in favor of defining properties directly on the object, and this seemed like a reasonable approach for that.

Smaller default properties set
TL;DR having extra properties that aren't applicable to the content is confusing.

@escowles pointed me to a document produced by the hydra technical metadata group. It detailed some core technical metadata. It is a small default set that could be added to as needed by classes further downstream.

Having height and width methods on a GenericFile that contains an audio file was unsettling. Interrogating an object for the height and width I was not sure if characterization could not determine those, or if they were not supposed to be there in the first place. So the implementation in this PR punts that decision to the downstream projects and sticks to a minimal set of technical metadata.

escowles commented 9 years ago

:+1: to using SAX and having a smaller set of default properties. I think having a small default set and an example of how to extend it is the right approach, and a good approach for implementors to emulate.

jcoyne commented 9 years ago

@grosscol Now we are adding another piece of code and not really removing anything (OM is a dependency of AF). And the SAX parser isn't a big benefit on a tiny xml file such as that omitted by Fits. Most importantly, we are already using several fields defined by the om definition that are not property being captured by this new implementation (image height and width, video height, width and duration, audio duration). Finally, I'm not seeing any direction on how to extend this approach provided with this PR.

grosscol commented 9 years ago

@jcoyne I was removing the cognitive overhead of sorting out OM. The terminology configuration was difficult to understand, and is designed around using and keeping xml datastreams.

I agree that DOM vs FITS for small files is half of one six of the other. I was also working on an approach using OM. It ends up building an entire OM XML Document only to discard it. That seemed like building an entire scaffolding when we would only need a ladder.

Making a sub class of Generic file with additional properties and mappings is how I would expect downstream applications to add the properties they require. I don't see a benefit for having video height on a pdf or audio file. At best, that's not harmful.

IMG_FITS_TO_RDF = { image_width:  ["imageWidth",  "toolname=Jhove"],
                    image_height: ["imageHeight", "toolname=Jhove"],
                    color_space:  ["colorSpace",  "toolname=Jhove"] }

class Demo < Hydra::Works::GenericFile::Base
  include Hydra::Works::GenericFile::Characterization
  property :image_width, predicate: ::RDF::URI.new('http://NeedAPredicate.org/ns/hydra/imageW'), multiple: false
  property :image_height, predicate: ::RDF::URI.new('http://NeedAPredicate.org/ns/hydra/imageH'), multiple: false
  property :jhove_time, predicate: ::RDF::URI.new('http://NeedAPredicate.org/ns/hydra/executionTime'), multiple: false
  parse_config.merge!(IMG_FITS_TO_RDF)
end

I have a branch where I was using OM, and then trying to get the values from the resulting OM XML Document. I can go back to that branch and get it to an equivalent state.

jcoyne commented 9 years ago

I'm absolutely not attached to OM. I do think we should support a base list of predicates that can support all types of media because this is commonly for a self deposit system. We can't say only use image predicates if its an image, because it is FITS itself that tells us so.