neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Consider converting the `data_type` of all images to `uint16` to save space #288

Open naga-karthik opened 8 months ago

naga-karthik commented 8 months ago

Given that we're frequently running into out-of-storage issues on our clusters, I am proposing a short-term workaround which is to convert all images to data_type: uint16 for our datasets. Ideally this can be an additional step during BIDSification. Sometimes the datasets are provided to us float64 and they take up a lot of space.

Here is an example for an image from the bavaria-quebec-spine-ms dataset. Note that this dataset contains whole spine images (i.e. axial/sag/coronal originally in chunks that are stitched to get the complete spine).

➜ ~/Downloads/bavaria_spine_ms_new/sub-m029621/ses-20100721/anat$ sct_image -i sub-m029621_ses-20100721_acq-ax_T2w.nii.gz -header | grep data_type
data_type   FLOAT64                                                                                                                           
➜ ~/Downloads/bavaria_spine_ms_new/sub-m029621/ses-20100721/anat$ du -sh sub-m029621_ses-20100721_acq-ax_T2w.nii.gz
240M    sub-m029621_ses-20100721_acq-ax_T2w.nii.gz                                                                                                          
➜ ~/Downloads/bavaria_spine_ms_new/sub-m029621/ses-20100721/anat$ sct_image -i sub-m029621_ses-20100721_acq-ax_T2w.nii.gz -type uint16 -o sub-m029621_ses-20100721_acq-ax_T2w_uint16.nii.gz
➜ ~/Downloads/bavaria_spine_ms_new/sub-m029621/ses-20100721/anat$ du -sh sub-m029621_ses-20100721_acq-ax_T2w*
240M    sub-m029621_ses-20100721_acq-ax_T2w.nii.gz
 43M    sub-m029621_ses-20100721_acq-ax_T2w_uint16.nii.gz                      
➜ ~/Downloads/bavaria_spine_ms_new/sub-m029621/ses-20100721/anat$ sct_image -i sub-m029621_ses-20100721_acq-ax_T2w_uint16.nii.gz -header | grep data_type
data_type   UINT16                                                                

This resulted in ~82% decrease in the disk space taken by this one file. Given that this dataset has ~400 subjects, there will be a huge difference in the total size of the dataset.

Thoughts on this? @jcohenadad @mguaypaq also tagging @jqmcginnis

jcohenadad commented 8 months ago

In general I am reluctant to remove information in the data, unless we are 100% that it won't affect the subsequent processing. For example, in this case, how can we be sure that reducing the precision in pixel intensity won't affect our ability to detect subtle signal intensity variation in presence of slight hyperintense lesion? If you can motivate the conversion to UINT16 by making sure this won't affect the quality of the processing, then I'm all for it.

On a separate note, we can also buy more HDs if storage is limited

naga-karthik commented 8 months ago

Agreed with your comments. It definitely removes the precision in pixel intensity by rounding off.

Upgrading romane's storage is open issue and Nick is working on it.

But, specifically for bavaria-quebec-spine-ms it's more of a pragmatic solution to move forward in the project. If we keep the data in its original float64 format, the data size is about ~400G and we don't have space for other users if I download the dataset on romane for model training.

Hence, my suggestion would be at least change the data type for bavaria-quebec-spine-ms so that storage is not a bottleneck in the immediate future

jcohenadad commented 8 months ago

How about

mguaypaq commented 8 months ago

I would also be wary about loss of precision, but yes this is an empirical question, so Julien's test is a good idea.

Note that git-annex can possibly help, here. Most people know about git annex get, but I think most people don't know about git annex drop: this command lets you free up the disk space locally for files that are already saved on the server. And both commands let you specify a list of files and/or directories. When processing the bavaria-quebec-spine-ms dataset, do you really need the entire dataset all at once? For example, if your processing happens one subject at a time, your processing script can do:

For this to work well, you might need to run eval "$(ssh-agent)" && ssh-add in the terminal before the start of processing, so that git annex doesn't need to ask for your password every time it talks to the server.

Or, if you need all subjects at once, but only the axial images, you can git annex get only the axial images, etc.

Separately, I recently noticed that nibabel doesn't use the best gzip compression settings for writing .nii.gz files (see upstream discussion on nibabel#382). But I ran a quick test on bavaria-quebec-spine-ms and it doesn't save that much space, compared to changing the dtype. I looked at the axial and sagittal images for sub-m023917, and compared:

The result is (file sizes in bytes):

28681234 ax-orig.nii.gz
27681513 ax-rezip.nii.gz
11315319 ax-i16.nii.gz
11188415 ax-i16-rezip.nii.gz

16202433 sag-orig.nii.gz
15678871 sag-rezip.nii.gz
 6618456 sag-i16.nii.gz
 6505885 sag-i16-rezip.nii.gz

So, re-zipping only saves 1-5% (small), but changing the dtype cuts the size by a factor of 2-3 (big).

NathanMolinier commented 8 months ago

To improve the way we use git annex, I was planning to develop in the next weeks a script to get only specific data from datasets thanks to a config files. Thanks to this script we will be able to get specific files from different datasets without downloading all the data for all the datasets.

naga-karthik commented 8 months ago

Thank you @mguaypaq for your suggestions! As for your questions:

do you really need the entire dataset all at once?

For model training yes, but usually there are multiple contrasts/orientations present and my model usually uses one of those so I get your point about using git annex drop.

you can git annex get only the axial images, etc

yes, this is what I was thinking of doing next