mne-tools / mne-bids-pipeline

Automatically process entire electrophysiological datasets using MNE-Python.
https://mne.tools/mne-bids-pipeline/
BSD 3-Clause "New" or "Revised" License
138 stars 66 forks source link

reduce storage size of derivative folders? #922

Closed SophieHerbst closed 5 months ago

SophieHerbst commented 6 months ago

I just realized that my derivatives folders are 22GB per participant. My epochs are very long (10s), needed for this particular study. Still, do we need to save all the intermediate steps given the automatic caching where things are rerun anyways if parameters change? For instance, I have 4 epoch files per participant:

Screenshot 2024-04-02 at 18 01 41
larsoner commented 6 months ago

We could probably at least get rid of -icafit_epo.fif. We can recreate that on the fly for the two steps that use it I think instead of saving it during the first step. It be slower so we have to decide if we want that tradeoff to save on some hard drive space (I think probably it's worth it).

For the others it's not as easy. The idea behind MNE-BIDS-Pipeline is that each step has a defined step of M inputs, and N outputs. If you want to know if a step needs to be rerun, you check for existence and validity (age/hash) of both M and N. If something is wrong you recompute. So given files get created _epo.fif -> _proc-ica_epo.fif -> _proc-clean_epo.fif in three separate steps (epoching, apply ica, and peak-to-peak, respectively) even though the first two might be considered transient / unnecessary would mean we'd have to change how caching works... and I'm not sure how easy that would be.

Also in some cases we need those intermediate files -- for example we have a config param that allows you to decide whether to use _epo.fif or _proc-clean_epo.fif in decoding. So keeping intermediate files around can help.

Finally one important reason to keep intermediate files is it's sometimes invaluable for debugging problems. Let's say you get a _proc-clean.fif that looks bad for some reason. Being able to work backward through _proc-ica_epo.fif then to _epo.fif you can identify the step where things went wrong. The report in general is good for this but it's never as comprehensive/complete as being able to look at the original data.

hoechenberger commented 6 months ago

We could offer to select which output files to automatically remove once the pipeline run was completed. Like, a new last step named "cleanup" or so

larsoner commented 6 months ago

But if we do that and the user does mne_bids_pipeline config.py, it will recompute and recreate all of those files. I don't think that's good.

SophieHerbst commented 6 months ago

Some sort of final archive or cleanup function once the study is completed might be a good idea to save storage space?

larsoner commented 6 months ago

To me if you're 100% done and want to save space this shouldn't be MNE-BIDS-Pipeline's job really -- it should be easy enough to do with a custom shell (or Python) script at the end.

hoechenberger commented 5 months ago

@SophieHerbst Do you think this can be closed for the time being?

SophieHerbst commented 5 months ago

yes @hoechenberger!