neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Migration #33

Open kousu opened 3 years ago

kousu commented 3 years ago

We need to migrate datasets off smb://duke.neuro.polymtl.ca and onto git+ssh://data.neuro.polymtl.ca.

I imagine both will live on for a while, but we want to prefer the git server to:

a. save space by using branching instead of duplicating entire datasets b. have provenance tracking

To do this we need to (I think):

  1. set up permissions (#27) to replace ActiveDirectory permissions
    • this will mean we can self-manage permissions, which will be nice; but also it's an extra responsibility, so we should probably have some auditing scripts too
  2. De-duplicate the duplicated datasets
    • this is the hardest and slowest part
  3. Make each (deduplicated) dataset into BIDS format (just to be sure)
  4. Migrate each dataset to git-annex
  5. Upload each dataset to the server
kousu commented 3 years ago

duke has a huge amount of storage attached, so. We've been promised that the storage on the git server can be expanded as needed but for now it is 1TB, so we need to do some reconnaissance.

I'm starting here:

$ df -h
Filesystem                    Size  Used Avail Use% Mounted on
[...]
//132.207.65.200/histology    8.9T  7.4T  1.5T  84% /home/GRAMES.POLYMTL.CA/me/duke/histology
//132.207.65.200/mri          8.9T  7.4T  1.5T  84% /home/GRAMES.POLYMTL.CA/me/duke/mri
//132.207.65.200/projects     4.4T  4.3T   68G  99% /home/GRAMES.POLYMTL.CA/me/duke/projects
//132.207.65.200/public       4.4T  4.3T   68G  99% /home/GRAMES.POLYMTL.CA/me/duke/public
//132.207.65.200/sct_testing  4.4T  4.3T   68G  99% /home/GRAMES.POLYMTL.CA/me/duke/sct_testing
//132.207.65.200/temp         4.4T  4.3T   68G  99% /home/GRAMES.POLYMTL.CA/me/duke/temp

Okay so it looks like the CIFS mounts are shared from two disks: a 4.4T disk and a 8.9T disk, and that we've used 12T in all. But I've gotta think that most of that is junk.

I'm starting here by locating duplicate files in one of the shares:

$ time fdupes -r -H /home/GRAMES.POLYMTL.CA/me/duke/projects 2>&1 | tee ~/duke-projects-duplicates.txt
[TO BE FILLED IN WHEN IT FINISHES]
kousu commented 3 years ago

^ first attempt reset halfway through. it's a lot of data to process. trying again now.