neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Download uk biobank dataset #107

Open kousu opened 3 years ago

kousu commented 3 years ago

Our download access to https://biobank.ctsu.ox.ac.uk/ is ending on 2021-08-18.. We need to archive as much as possible to our internal servers before that date.

Their download docs are https://biobank.ctsu.ox.ac.uk/~bbdatan/Accessing_UKB_data_v2.3.pdf. We have a license keyfile on smb://duke/<TODO>

They have three programs (because they invented their own API, what I want to avoid for #77) to do the download:

We don't need the entire dataset, but a subset of images, metadata fields, and subjects.

The dataset is estimated to be 38TB, so we need more storage space. data.neuro.polymtl.ca only has 1TB.

jcohenadad commented 3 years ago

I talked with Pierre Bellec yesterday, we might have additional options for temporary hosting:

kousu commented 3 years ago

I am unsure how the tape storage works, but looking around their docs https://docs.computecanada.ca/wiki/Using_nearline_storage explains that all their servers have a mountpoint /nearline which is a large disk that's backed by nightly archives to tape. I'd have to get in and see how it actually looks, but hopefully it is relatively simple to use.

They want us to store large files there, which means we need to put whatever we get download into a .tar file, or multiple .tar files, before writing to that disk. So it might be a little complicated.

alexfoias commented 3 years ago

@kousu did you manage to check the downloaded files on CC ?

kousu commented 3 years ago

@kousu did you manage to check the downloaded files on CC ?

over here: https://github.com/neuropoly/data-management/pull/105#issuecomment-898637991