Archiving data for manuscripts

lauradoepker commented 5 years ago

For (Overbaugh) manuscripts, we'd like to compile all Matsengrp-generated data used into one directory and zip it up (zip or tgz?) for safekeeping where it will join all the other Overbaugh-generated documents used for said manuscript. I envision the steps of this process to be:

1) Project lead would COPY the data from all relevant Matsen file locations into a new directory, also on the Matsen group servers. (e.g. I would copy partis output, CFT output, ecgtheow output, and linearham output all to a new directory named "QA255_lineages_data" or something)

2) When manuscript is first submitted, we would zip up this data and move this file to join the Overbaugh documents (in Laura's possession on Overbaugh/general FH servers)

3) When manuscript is resubmitted and accepted, all datasets would be edited if needed and zipped up all together and stored in the FH long-term storage servers (?)

psathyrella commented 5 years ago

We've been pretty happy using zenodo https://zenodo.org/ for this sort of thing. It seems to me at least it would be nicer if when people wanted something they didn't have to email us to have us copy it from the fh servers (if that was what you meant).

eharkins commented 4 years ago

Using cyberduck (desktop app to connect to hutch economy storage: https://sciwiki.fredhutch.org/compdemos/Mountain-CyberDuck/)

We are zipping directories containing data using

tar -czvf name-of-archive.tar.gz /path/to/directory-or-file

c: Create an archive.
z: Compress the archive with gzip.
v: Display progress in the terminal while creating the archive, also known as “verbose” mode. The v is always optional in these commands, but it’s helpful.
f: Allows you to specify the filename of the archive.
Upon downloading the file from economy file storage, IN ORDER TO UNZIP FILE: one would replace the c with x and would either need to add -C before the output dir or specify no output dir)

and then copying these gz files to matsen_e economy storage files (set up according to the link above) with the following structure:

inside of overbaugh_data_archive
For each study, a directory beginning with a date in ISO format: YYYY-MM-DD, followed by some short descriptor of the study
Inside each study, a README.md file describing what all is included in that study's folder
The data for that study according to the README

I created overbaugh_data_archive and created an example study with a README.md

@lauradoepker please close this issue if and when you have success with this, thanks!

eharkins commented 4 years ago

Reopening as there was an issue with the following file:

https://tin.fhcrc.org/v1/AUTH_Swift_matsen_e/overbaugh_data_archive/2019-10-11_MiSeq_data_lnoges/NGS_raw_data_161207.gz

where it was corrupted somehow in the process of compression or uploading to economy file storage. Laura has the original data so we need to recompress and upload it.

Furthermore, she had to get this original data from the Sather group so it would be worth considering for us to have a copy around that is not in economy file storage, so that we have a backup.

eharkins commented 4 years ago

@matsen and I discussed having a procedure /script which does something like:

zip file
uploading to economy file storage
download from economy file storage
verify it can be successfully unzipped

My first attempt is in a gist (if it's bad that it includes the swift server name for economy storage - but not credentials necessary to access it - I will delete it).

It assumes the file is already zipped.

We should discuss how this should work more in person including any risks of relying on something like this to validate a successful upload.

eharkins commented 4 years ago

Script to upload and verify successful download and gzip format lives at /fh/fast/matsen_e/eharkins/economy_storage/upload_and_verify.py

To use, run the following on any of the hutch servers:

sw2account --save matsen_e as per sciwiki
/fh/fast/matsen_e/eharkins/economy_storage/upload_and_verify.py <src> <targ> where <src> is the local gzip format file you would like to upload (see above for how to create this given the data you want to zip) and <targ> is the path to where you would like to store it on economy file storage (e.g. /overbaugh_data_archive/YYYY-MM-DD_study/file.gz)

See the sciwiki docs for more info on accessing economy file storage from the command line via the swc command

eharkins commented 4 years ago

See https://github.com/matsengrp/wiki/wiki/Economy-File-Storage-(long-term-data-archive) for documentation of this process from here on out.

matsengrp / cft

Archiving data for manuscripts #287