This repository contains scripts and documentation for backing up objects from DataSpace into separate storage, utilizing DSpace export and Python-based BagIt packaging workflows.
git clone git@github.com:pulibrary/dataspace_preservation.git
git clone git@github.com:pulibrary/dspace-python.git
cd into the dataspace_preservation directory
Install gems via bundler bundle install
If you are backing up a collection that is not under the Senior Theses community, run list_arks.rb
as follows to output a list of arks that can be used to create a manifest as follows (arguments supplied are examples):
bundle exec ruby list_arks.rb -h https://dataspace-staging.princeton.edu -a ark:/88435/dsp0100000007x
You should see output something like the following:
88435/dsp01d791sj97j
88435/dsp01v405sc863
If you are backing up Senior Theses, run list_arks.rb
and supply a class year as a command line argument, to output a list of arks that can be used to create a manifest as follows (arguments supplied are examples):
bundle exec ruby list_arks.rb -h https://dataspace-staging.princeton.edu -a ark:/88435/dsp0100000007x -c 2019
You should see output something like the following:
88435/dsp014m90dz32m
88435/dsp01mp48sg59q
88435/dsp01gx41mm67g
88435/dsp017d278w83s
88435/dsp01w66346442
...
Pipe the list_arks output to a manifest file, examples:
bundle exec ruby list_arks.rb -h https://dataspace-staging.princeton.edu -a ark:/88435/dsp0100000007x > manifest
or
bundle exec ruby list_arks.rb -h https://dataspace-staging.princeton.edu -a ark:/88435/dsp0100000007x -c 2019 > manifest
Transfer the manifest to the DataSpace server. (consult RDSS team if you need help with the ssh configuration information)
SSH to the appropriate DataSpace server (staging or production).
wget https://raw.githubusercontent.com/pulibrary/dataspace_preservation/main/export_from_dspace.sh
./export_from_dspace.sh manifest exports_directory
tar -cvf ~pulsys/exports_directory.tar exports_directory
gzip ~pulsys/exports_directory.tar.gz
Locally, using rsync or something similar, copy the files down from the server to local storage. for example
scp pulsys@gcp_dataspace_prod1:exports_directory.tar.gz .
cd to the dspace-python project directory and run the commands from the README
Run the BagIt code from dspace-python as follows:
python bagit-python/bagit.py ../dataspace_preservation/exports_directory/2019_theses
Where exports_directory/2019_theses
is an example of the value of the path to the local copy of the DSpace exports directory that you populated with the rsync command above.
Inspect the exports directory. It should look something like the following:
ls -la exports/2019_theses
bag-info.txt
bagit.txt
data/
manifest-sha256.txt
manifest-sha512.txt
tagmanifest-sha256.txt
tagmanifest-sha512.txt
Note that the data/
directory should contain all of the exported DSpace object directories.
Compress the bag directory as follows:
tar -czf 2019_theses.tgz 2019_theses
Transfer the compressed backups to remote storage.