Accommodate training of very big datasets

mpompolas commented 3 years ago

This issue is placed in order to start a discussion about the potential options.

The datasets that we have in our disposal are only getting bigger.

We should slowly start thinking how we can speed up the loading process, since right now the loader is loading everything before the computations get started leading to very long waiting times.

We should probably consider partial loading in extreme cases and/or modifications to the loader based on the available RAM

dyt811 commented 3 years ago

So my limited understanding is that PyTorch Dataset/DataLoader can readily handle multi terabyte of training data without issue. But because we are preloading some of the 2d/3d volumes data for various reasons into Bids3DDataset/BidsDataset objects before training time which even though those classes inherit from Dataset/DataLoader and will train okay by only loading bulk of the training relevant data at training time, we are potentially going to see CPU RAM overflow before training at BIDS object formation stage, and not GPU RAM overflow during mini-batch training, is that correct?

maffos commented 3 years ago

Hi, I am trying to configure Ivadomed to use for my Uni project. From my understanding all crops of the dataset are loaded into main-memory before training and even though I am running the code on a cluster node, the RAM is overflowing with a small to medium sized dataset, containing roughly 100 patients. Have you considered any workaround? (I installed ivadomed with pypi and I think the most current version there is 2.3.1) Otherwise I would try modifying the code such that the images are only actually loaded within the getitem function of the dataloader.

charleygros commented 3 years ago

Hi @maffos, Thanks for your message!

From my understanding all crops of the dataset are loaded into main-memory before training

Yes, correct!

Have you considered any workaround?

We have been discussing to implement solutions to address this issue, things around "training on the fly". But nothing really started at this stage.

Otherwise I would try modifying the code such that the images are only actually loaded within the getitem function of the dataloader.

That would be really great! I do like the idea you suggested! Please feel free to create a "Draft" Pull Request so that the ivadomed team can support you with the integration of the changes etc.

jcohenadad commented 2 years ago

related to #773

naga-karthik commented 2 years ago

Hey @dyt811 ! What's the status of this issue? The priority is already high and there's an open project, should we discuss this in the next meeting? Maybe this should be revived asap because we are about to start training with big datasets in the ivadomed-eeg project with @mpompolas.

dyt811 commented 2 years ago

@naga-karthik good point. I was hoping @entiri would be able to look into this a bit more post parallel training PR fix but ran into quite a bit road block getting metrics back for his parallel training. Let's talk about a bit more now that I have Compute Canada access and should be able to test larger data loading in conjunction with the latest singularity fixes.

uzaymacar commented 2 years ago

As per the ivadomed meeting today, I investigated how we do things currently vs. what MONAI does. Following up on @mpompolas's suggestion, here is what I came up with:

To briefly summarize:

In both MONAI and ivadomed, non-deterministic data augmentations are applied in the __get_item() function of the corresponding dataset classes, i.e. on the fly during training.
As discussed in this thread, ivadomed pre-computes all preprocessing steps and slicing in the __init__() function of our dataset classes. The preprocessed data is cached in the self.handlers list. See the _load_filenames() and _prepare_indices() helper functions for this.
MONAI implements a CacheDataset() class which is analogous to how we do things in ivadomed. It also implements a PersistentDataset() class that saves the preprocessed data to a directory to be read later as opposed to storing it on RAM. This helps handle accommodate large datasets and is analogous to @jcohenadad's suggestion in the previous ivadomed meeting.
Besides the standard implementations, MONAI includes more niche methods for caching and persistency which I am yet to explore further. Among them is Smart Cache from NVIDIA which "determines the most effective data to cache based on the determinism of transforms". This method mitigates the issue of exploding RAM for caching methods as it uses a configurable cache capacity (i.e. store as much as you can in RAM w/o exploding). This is similar to @mpompolas's suggestion IIRC.

I think a nice next step would be to add a key to our dataset classes that reflects the pre-computation strategy with options persistent and cache (default in ivadomed). persistent can be implemented by saving the contents of the self.handlers list to directory, removing the contents from memory, and re-reading the contents during __get_item(). Another topic of discussion for the near-future would be to see if we can integrate MONAI's dataset classes as well (in parallel to the models effort in #1116).

ivadomed / ivadomed

Accommodate training of very big datasets #799

This issue is placed in order to start a discussion about the potential options.