pulibrary / dataspace_preservation

Scripts and workflow documentation for backing up DataSpace objects
1 stars 0 forks source link

DataSpace Preservation Workflows

This repository contains scripts and documentation for backing up objects from DataSpace into separate storage, utilizing DSpace export and Python-based BagIt packaging workflows.

Resources

Requirements

Setup

  1. Install requirements.
    1. clone this repo git clone git@github.com:pulibrary/dataspace_preservation.git
    2. Also clone dspace-python git clone git@github.com:pulibrary/dspace-python.git

Instructions

  1. cd into the dataspace_preservation directory

  2. Install gems via bundler bundle install

  3. If you are backing up a collection that is not under the Senior Theses community, run list_arks.rb as follows to output a list of arks that can be used to create a manifest as follows (arguments supplied are examples):

    bundle exec ruby list_arks.rb -h https://dataspace-staging.princeton.edu -a ark:/88435/dsp0100000007x

    You should see output something like the following:

    88435/dsp01d791sj97j
    88435/dsp01v405sc863
  4. If you are backing up Senior Theses, run list_arks.rb and supply a class year as a command line argument, to output a list of arks that can be used to create a manifest as follows (arguments supplied are examples):

    bundle exec ruby list_arks.rb -h https://dataspace-staging.princeton.edu -a ark:/88435/dsp0100000007x -c 2019

    You should see output something like the following:

    88435/dsp014m90dz32m
    88435/dsp01mp48sg59q
    88435/dsp01gx41mm67g
    88435/dsp017d278w83s
    88435/dsp01w66346442
    ...
  5. Pipe the list_arks output to a manifest file, examples:

    bundle exec ruby list_arks.rb -h https://dataspace-staging.princeton.edu -a ark:/88435/dsp0100000007x > manifest

    or

    bundle exec ruby list_arks.rb -h https://dataspace-staging.princeton.edu -a ark:/88435/dsp0100000007x -c 2019 > manifest
  6. Transfer the manifest to the DataSpace server. (consult RDSS team if you need help with the ssh configuration information)

  7. SSH to the appropriate DataSpace server (staging or production).

    1. Become the dspace user
    2. Create a directory on the server where your exports will be stored temporarily.
    3. get the export from dataspace script
      wget https://raw.githubusercontent.com/pulibrary/dataspace_preservation/main/export_from_dspace.sh
    4. run it with the manifest and exports directory as command line arguments, example:
      ./export_from_dspace.sh manifest exports_directory
    5. tar and zip the data
      tar -cvf ~pulsys/exports_directory.tar exports_directory
      gzip ~pulsys/exports_directory.tar.gz
  8. Locally, using rsync or something similar, copy the files down from the server to local storage. for example

    scp pulsys@gcp_dataspace_prod1:exports_directory.tar.gz .
  9. cd to the dspace-python project directory and run the commands from the README

  10. Run the BagIt code from dspace-python as follows:

    python bagit-python/bagit.py ../dataspace_preservation/exports_directory/2019_theses

    Where exports_directory/2019_theses is an example of the value of the path to the local copy of the DSpace exports directory that you populated with the rsync command above.

  11. Inspect the exports directory. It should look something like the following:

    ls -la exports/2019_theses
    bag-info.txt
    bagit.txt
    data/
    manifest-sha256.txt
    manifest-sha512.txt
    tagmanifest-sha256.txt
    tagmanifest-sha512.txt

    Note that the data/ directory should contain all of the exported DSpace object directories.

  12. Compress the bag directory as follows:

    tar -czf  2019_theses.tgz 2019_theses
  13. Transfer the compressed backups to remote storage.