mjordan / islandora_bagger

Tool for generating Bags for Islandora 8 content.
MIT License
4 stars 12 forks source link

Islandora Bagger

Utility to generate Bags for objects using Islandora's REST interface using either a command-line tool or via a batch-oriented queue. In addition, Islandora Bagger provides its own REST interface that allows population of the queue. Specific content is added to the Bag's data directory and bag-info.txt file using plugins. Bags are compliant with version 1.0 of the BagIt specification. If you want to allow your Islandora users to initiate the creation of Bags, install the Islandora Bagger Integration module.

This utility is for Islandora 8.x-1.x. For creating Bags for Islandora 7.x, use Islandora Fetch Bags.

Requirements

Installation

  1. Clone this git repository.
  2. cd islandora_bagger
  3. php composer.phar install (or equivalent on your system, e.g., ./composer install)
  4. If you want to try Islandora Bagger's REST interface, you can install the Symfony Local Web Server. Note that this server is part of the Symfony binary, which is not required by Islandora Bagger otherwise.

Configuration

Even though each Bag is created using options defined in its own configuration file (see next section), Islandora Bagger uses several application-wide configuration options defined in the parameters section of config/services.yaml.

You probably don't need to change app.queue.path and app.location.log.path since these specify default locations for some data files. However, if you are providing the ability for users to download serialized Bags, you will need to change the app.bag.download.prefix parameter to the hostname/path to append to each Bag's filename as described in the "Making Bags downloadable" section below.

Command-line usage

The command to generate a Bag takes two required parameters, --settings and --node. Assuming the configuration file is named sample_config.yml, and the Drupal node ID you want to generate a Bag from is 112, the command would look like this:

./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112

A third parameter, --extra, is explained in the "Passing settings via the command line" section below.

The per-Bag configuration file

For each Bag it creates, Islandora Bagger requires a configuration file in YAML format:

####################
# General settings #
####################

# Required.
drupal_base_url: 'http://localhost:8000'
drupal_basic_auth: ['admin', 'islandora']

# Register creation of this Bag with Islandora Bagger Integration. Default is false.
register_bags_with_islandora: true

# Required. How to name the Bag directory (or file if serialized). One of 'nid' or 'uuid'.
bag_name: nid

# Optional. Template for the Bag name. The % is replaced by the nid or uuid (depending on
# the value of "bag_name") in the name of the Bag directory (or file if serialized). If absent,
# the bare value of the nid or uuid is used.
# bag_name_template: sfu_aip_%

# Both temp_dir and output_dir are required.
temp_dir: /tmp/islandora_bagger_temp
output_dir: /tmp

# Required. Whether or not to zip up the Bag. One of 'false', 'zip', or 'tgz'.
serialize: zip

# Required. Whether or not to log Bag creation. Set log output path in config/packages/{environment}/monolog.yaml.
log_bag_creation: true

# Optional. Static bag-info.txt tags. No plugin needed. You can use any combination
# of tag name / value here, as long as ou seprate tags from values using a colon (:).
bag-info:
    Contact-Name: Mark Jordan
    Contact-Email: bags@sfu.ca
    Source-Organization: Simon Fraser University
    Foo: Bar

# Optional. Whether or not to include the Payload-Oxum tag in bag-info.txt. Defaults to true.
# include_payload_oxum: false

# Optional. Which hash algorithm(s) to use.
# One of md5, sha1, sha224, sha256, sha384, sha512, sha3224, sha3256, sha3384, sha3512,
# or a list of values. Default is sha512.
# hash_algorithm: md5
# hash_algorithm: [md5, sha1, sha256]

# Optional. Timeout to use for Guzzle requests, in seconds. Default is 60.
# http_timeout: 120

# Optional. Whether or not to verify the Certificate Authority in Guzzle requests
# against websites that implement HTTPS. Used on Mac OSX if Islandora Bagger is
# interacting with websites running HTTPS. Default is true. Note that if you set
# verify_ca to false, you are bypassing HTTPS encryption between Islandora Bagger
# and the remote website. Use at your own risk.
# verify_ca: false

# Optional. Whether or not to delete the settings file upon successful creation
# of the Bag. Default is false.
# delete_settings_file: true

# Optional. Whether or not to log the serialized Bag's location so Islandora can
# retrieve the Bag's download URL. Default is false.
# log_bag_location: true

############################
# Plugin-specific settings #
############################

# Required. Register plugins to populate bag-info.txt and the data directory.
# Plugins are executed in the order they are listed here.
plugins: ['AddBasicTags', 'AddMedia', 'AddNodeJson', 'AddNodeJsonld', 'AddMediaJson', 'AddMediaJsonld', 'AddFileFromTemplate', 'AddFedoraTurtle', 'AddNodeCsv']

# Used by the 'AddFedoraTurtle' plugin.
fedora_base_url: 'http://localhost:8080/fcrepo/rest/'

# Used by the 'AddMedia' plugin. These are the Drupal taxomony term IDs
# from the "Islandora Media Use" vocabulary. Use an emply list (e.g., [])
# to include all media.
drupal_media_tags: ['/taxonomy/term/16']

# Used by the 'AddMedia' plugin. Indicates whether the Bag should contain a file
# named 'media_use_summary.tsv' that lists all the media files plus the taxonomy
# name corresponding to the 'drupal_media_tags' list. Default is false.
include_media_use_list: true

# Used by the 'AddMedia' plugin. Include this option save media files with the
# specified subdirectories within the Bag's data directory. Include the trailing /.
# media_file_directories: 'foo/bar/baz/'

# Used by the 'AddFileFromTemplate' plugin.
# template_path can be absolute or relative to the Islandora Bagger directory.
template_path: 'templates/mods.twig'
# template_output_filename will be assigned to the file generated from the template,
# which will be added to the Bag's data directory. You may include a subdirectory
# or subdirectories as part of the filename.
templated_output_filename: 'metadata/MODS.xml'

# Used by the 'AddNodeCsv' plugin.
# csv_output_filename will be assigned to the CSV file, which will be added to
# the Bag's data directory. You may include a subdirectory or subdirectories
# as part of the filename.
csv_output_filename: 'metadata.csv'

####################
# Post-Bag scripts #
####################

# post_bag_scripts: ["php /tmp/test.php", "python /path/to/script.py"]

The resulting Bag would look like this:

/tmp/112
├── bag-info.txt
├── bagit.txt
├── data
│   ├── IMG_1410.JPG
│   ├── media.json
│   ├── media.jsonld
│   ├── node.json
│   ├── node.jsonld
│   ├── metadata
│   │   └── MODS.xml
│   ├── metadata.csv
│   ├── media_use_summary.tsv
│   └── node.turtle.rdf
├── manifest-sha1.txt
└── tagmanifest-sha1.txt

Since the Drupal node's ID is not included in the configuration file, the same file can be used for multiple Bags. It is called a 'per-Bag' configuration file because it is used each time Islandora Bagger creates a Bag.

Placing per-Bag configuration options in services.yml

In some cases, you may want to define configuration options in config/services.yml that are normally defined in the per-Bag configuration file. The most common reasons to do this are 1) to keep sensitive data such as login credentials out of the per-Bag configuration files and 2) to centralize commonly used options in one place rather than repeat them in each per-Bag configuration file.

To do this, define the options from the per-Bag configuration file in config/services.yml and prepend their keys with app.. For example, to define drupal_base_url and drupal_basic_auth in config/services.yml, do the following:

1) Comment them out or remove them from the per-Bag file:

# Required.
# drupal_base_url: 'http://localhost:8000'
# drupal_basic_auth: ['admin', 'islandora']

2) Define them in the parameters section of config/services.yml and append each option key with app.:

parameters:
    app.queue.path: '%kernel.project_dir%/var/islandora_bagger.queue'
    app.location.log.path: '%kernel.project_dir%/var/islandora_bagger.locations'
    # The hostname/path to where users can download serialized bags. This string
    # will be prepended to the Bag's filename.
    app.bag.download.prefix: 'http://example.com/bags/'

    # These options are usually defined in the per-Bag config file.
    app.drupal_base_url: 'http://localhost:8000'
    app.drupal_basic_auth: ['admin', 'islandora']

A couple of things to note about this:

Passing settings via the command line

You can pass settings to Islandora Bagger on the command line using the optional --extra parameter:

./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112 --extra='{"serialize": "tar", "hash_algorithm": "md5"}'

The value of this parameter is a serialized JSON object containing key:value pairs of settings. Key:value pairs passed in this way will be added to the config settings and will also override settings in the config file and in 'config/services.yml'.

REST interface usage

Islandora Bagger can also initiate the creation Bags via a simple REST interface. It does this by 1) receiving a PUT request containing the node ID of the Islandora object to be bagged in a "Islandora-Node-ID" header and 2) receiving a YAML configuration file as the body of the request. Using this data, it adds the request to a queue (see below), which is then processed at a later time. The REST interface also provides the ability to GET a Bag's download URL.

Note that requests to the REST interface do not generate Bags directly, they only populate a queue as described below.

To use the REST API to add a Bag-creation job to the queue:

  1. Run symfony server:start
  2. Prepare a YAML configuration file for posting to the REST API.
  3. Run curl -v -X POST -H "Islandora-Node-ID: 4" --data-binary "@sample_config.yml" http://127.0.0.1:8001/api/createbag

To use the REST API to get a serialized Bag's location for download:

  1. Make sure your configuration file's serialize setting is either "zip" or "tgz", and the log_bag_creation setting is true.
  2. Create a Bag using the command-line or via a REST PUT request.
  3. Start the web server, as above, if not already started.
  4. Run curl -v -H "Islandora-Node-ID: 4" http://127.0.0.1:8001/api/createbag. Your response will be a JSON string containing the node ID, the Bag's location, and an ISO8601 timestamp of when the Bag was created, e.g.:

    {"nid":"4","location":"http:\/\/example.com\/bags\/4.zip","created":"2019-05-06T19:31:33-0700"}

A couple of things to note about this REST API:

Making Bags downloadable

As described in the previous section, the location of each Bag is available via a GET request to Islandora Bagger's REST interface. If you want to use this information to provide a way to download Bags from Islandora Bagger, follow these steps:

GET requests to the REST API will now return location values that contain URLs that combine the path specified in app.bag.download.prefix with the serialized Bag's filename.

This is insecure, since anyone who can guess the path to a Bags will have access to it. Please join the discussion at this issue if you have a suggestion on implementing more robust security on Bag downloads.

Another approach is to use a post-Bag script (see below) to copy the Bag to a location from where it can be downloaded, and to email the user with the location.

The queue

Islandora Bagger implements a simple processing queue, which is populated mainly by REST requests to generate Bags. However, the queue can be populated by any process (manually, scripted, etc.). Islandora Bagger processes the queue by inspecting each entry in first-in, first-out order and for each entry, runs the app:islandora_bagger:create_bag command, which creates the Bag by fetching the files and other data from the Islandora instance as defined in that entry's configuration file.

The queue is a simple tab-delimited text file that contains one entry per line. The three fields in each entry are 1) the node ID, 2) the full path to the YAML configuration file, and 3) and ISO8601 timestamp, e.g.:

2073 /home/mark/Documents/hacking/islandora_bagger/var/islandora_bagger.2073.yaml 2020-09-14T19:01:46-0700

To process the queue, run the following command:

./bin/console app:islandora_bagger:process_queue --queue=var/islandora_bagger.queue

where the value of the --queue option is the path to the queue file. This command is then executed as needed, or from within a scheduled job managed by cron. This command iterates through the queue in first-in, first-out order. Once processed, the entry is removed from the queue. You can also optionally specify how many queue entries to process by including the --entries option, e.g., ./bin/console app:islandora_bagger:process_queue --queue=var/islandora_bagger.queue --entries=100

Inspecting the queue

Since the queue file is just a plain tab-separated value file, looking at its contents can be done in a variety of ways (openning it in a text editor, using cat, etc.). Islandora Bagger offers two other ways of inspecting the queue:

  1. via the console command app:islandora_bagger:get_queue (e.g. ./bin/console app:islandora_bagger:get_queue --queue=var/islandora_bagger.queue --output_format=json)
  2. via the REST interface (e.g. curl -v http://127.0.0.1:8000/api/queue)

In both cases, the output is a serialized JSON object containing each item in the queue. The console command can also print the raw queue if the --output_format option has a value of "csv").

Customizing the Bags

Customizing the generated Bags is done via values in the configuration file and via plugins.

Configuration file

Items in the "General Configuration" section provide some simple options for customizing Bags, e.g.:

Plugins

Apart from the static tags mentioned in the previous section, all file content and additional tags are added to the Bag using plugins. Plugins are registerd in the plugins section of the configuration file.

Included plugins

The following plugins are bundled with Islandora Bagger:

Writing custom plugins

Each plugin is a PHP class that extends the base AbstractIbPlugin class. The Sample.php plugin illustrates what you can (and must) do within a plugin. Plugins are located in the islandora_bagger/src/Plugin directory, and must implement an execute() method. Within that method, you have access to the Bag object, the Bag temporary directory, the node's ID, the node's JSON representation from Drupal. You also have access to all values in the configuratin file via the $this->settings associative array.

To use a custom plugin, simply register its class name in the plugins list in your configuation file.

Post-Bag scripts

The post_bag_scripts option in the configuration file allows you to specify a list of scripts to run after the Bag has been successfully created. These scripts can send email messages, copy Bag files to alternate locations, and other tasks. You can include any script, in any language, with the following constraints:

In the YAML configuration file, you can define any options needed by your scripts, for example, an email address to send a message to. For example, if your script /opt/utils/send_bag_notice.py requires an email address to send its notice to, you can include that option's value in your configuration file, as long as the script can parse YAML files:

####################
# Post-Bag scripts #
####################

post_bag_scripts: ["python /opt/utils/send_bag_notice.py"]
recipient_email: preservation@example.ca

Then within your script, you would have access to the value of recipient_email. Within your scripts, you have access to all options used by Islandora Bagger's app:islandora_bagger:create_bag command, and you can define any additional options you need as long as they don't have the same key names as existing values.

To do

Current maintainer

Contributing

See CONTRIBUTING.md.

License

MIT