Table of Contents generated with DocToc
“This ain’t my first rodeo.” (an idiomatic American slang for “I’m prepared for what comes next.”)
The DerivativeRodeo
"moves" files from one storage location (e.g. input) to one or more storage locations (e.g. output) via a generator.
In the case of a input storage location (e.g. input_location
), we expect that the underlying file pointed at by the input storage location exists. After all we can't move what we don't have.
In the case of a output storage location (e.g. output_location
), we expect that the underlying file will exist after the generator has completed. The output storage location could already exist or we might need to generate the file for the output location.
There is also the concept of the pre_processed storage location; when the pre_processed storage location exists for the given input, copy that pre_processed file to the output location. And skip running the derivative generator on the input storage location. In other words, if we've already done the derivation elsewhere, use that.
During the generator's process, we need to have a working copy of both the input and output file. This is done by creating a temporary file.
In the case of the input, the creation of that temporary file involves getting the file from the input storage location. In the case of the output, we create a temporary file that the output storage location then knows how to move to the resulting place.
The above Storage Lifecycle diagram is as follows: input location
to input tmp file
to generator
to output tmp file
to output location
.
Note: We've designed and implemented the data life cycle to automatically clean-up the temporary files as the generator completes. In this way we can use the smallest working space possible. A design decision that helps run DerivativeRodeo
within distributed clusters (e.g. AWS Serverless).
In this case, common storage could mean the storage where we're writing all pre-processing of files. Or it could mean the storage where we're writing for application access (e.g. Fedora Commons for a Hyrax application).
In other words, the DerivativeRodeo
is part of moving files from one location to another, and ensuring that at each step we have all of the expected files we want.
This is not strictly related to Hyrax's FileSet, that is a set of files in which one is considered the original and all others are derivatives of the original.
However it is helpful to think in those terms; files that have a significant relation to each other; one derived from the other. For example an original PDF and it's extracted text would be two significantly related files.
Given a single original file in a previous home, we are copying that original file (and derivatives) to various locations:
Add this line to your application's Gemfile:
gem 'derivative-rodeo'
(Due to historical reasons the gem name is derivative-rodeo
even though the repository is derivative_rodeo
. The following "require" methods will work:
require 'derivative_rodeo'
require 'derivative-rodeo'
require 'derivative/rodeo'
And then execute: $ bundle install
Be aware that you need pdfinfo
command line tool installed for this gem to run specs or when using PDF functionality.
TODO
Generators are responsible for ensuring that we have the file associated with the generator. For example, the HocrGenerator is responsible for ensuring that we have the .hocr
file in the expected desired storage location.
Generators must have an initializer and build command:
.new(array_of_file_urls, output_location_template, preprocessed_location_template)
#generated_files
(executes the generators actions) and returns array of files#generated_uris
(executes the generators actions) and returns array of output urisBelow is the current list of generators.
.hocr
file).TODO: We want to expose a list of registered generators
Storage locations are where we put things. Each location has a specific implementation but is expected to inherit from the DerivativeRodeo::StorageLocation::BaseLocation.
DerivativeRodeo::StorageLocation::BaseLocation.locations
method tracks the registered locations.
The location represents where the file should be.
Storage locations follow a URI pattern
file://
:: “local” file system storages3://
:: AWS’s S3 storage systemsqs://
:: AWS’s SQSThroughout the code you'll see reference to the following concepts:
input_location_template
output_location_template
preprocessed_location_template
In Process Life Cycle we discussed the input_location
, output_location
, and preprocessed_location
. The concept of the template provides a flexibility in mapping a location to another location
Examples of mapping one file path to another are:
https://hello.com/world/GUID/file.jpg
to file:///tmp/GUID/file.jpg
.file:///tmp/GUID/file.jpg
to file:///tmp/GUID/file.hocr
; that is run OCR on an image and write a .hocr
file.file:///tmp/GUID/file.hocr
to generate a file:///tmp/GUID/file.coordinates.json
; that is convert the HOCR file to a coordinates.json file.See DerivativeRodeo::Service::ConvertUriViaTemplateService for more details.
git clone https://github.com/scientist-softserv/derivative_rodeo
cd derivative_rodeo; bundle install
rake install_hooks
pdfinfo
: provided by poppler (e.g. brew install poppler
)gs
): run brew install gs
Then go about writing your code and documentation.
The git hooks call rake default
which will:
rubocop
rspec
with simplecov
Throughout the DerivativeRodeo
we log some activity. In the typical test run, the logs are overly chatty. If you want the more chatty logs run the following: DEBUG=t rspec
.
Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-softserv/derivative_rodeo.