ivadomed / ivadomed

Repository on the collaborative IVADO medical imaging project between the Mila and NeuroPoly labs.
https://ivadomed.org
MIT License
154 stars 149 forks source link

Accommodate training of very big datasets #799

Closed mpompolas closed 2 years ago

mpompolas commented 3 years ago

This issue is placed in order to start a discussion about the potential options.

The datasets that we have in our disposal are only getting bigger.

We should slowly start thinking how we can speed up the loading process, since right now the loader is loading everything before the computations get started leading to very long waiting times.

We should probably consider partial loading in extreme cases and/or modifications to the loader based on the available RAM

dyt811 commented 3 years ago

So my limited understanding is that PyTorch Dataset/DataLoader can readily handle multi terabyte of training data without issue. But because we are preloading some of the 2d/3d volumes data for various reasons into Bids3DDataset/BidsDataset objects before training time which even though those classes inherit from Dataset/DataLoader and will train okay by only loading bulk of the training relevant data at training time, we are potentially going to see CPU RAM overflow before training at BIDS object formation stage, and not GPU RAM overflow during mini-batch training, is that correct?

maffos commented 3 years ago

Hi, I am trying to configure Ivadomed to use for my Uni project. From my understanding all crops of the dataset are loaded into main-memory before training and even though I am running the code on a cluster node, the RAM is overflowing with a small to medium sized dataset, containing roughly 100 patients. Have you considered any workaround? (I installed ivadomed with pypi and I think the most current version there is 2.3.1) Otherwise I would try modifying the code such that the images are only actually loaded within the getitem function of the dataloader.

charleygros commented 3 years ago

Hi @maffos, Thanks for your message!

From my understanding all crops of the dataset are loaded into main-memory before training

Yes, correct!

Have you considered any workaround?

We have been discussing to implement solutions to address this issue, things around "training on the fly". But nothing really started at this stage.

Otherwise I would try modifying the code such that the images are only actually loaded within the getitem function of the dataloader.

That would be really great! I do like the idea you suggested! Please feel free to create a "Draft" Pull Request so that the ivadomed team can support you with the integration of the changes etc.

jcohenadad commented 2 years ago

related to #773

naga-karthik commented 2 years ago

Hey @dyt811 ! What's the status of this issue? The priority is already high and there's an open project, should we discuss this in the next meeting? Maybe this should be revived asap because we are about to start training with big datasets in the ivadomed-eeg project with @mpompolas.

dyt811 commented 2 years ago

@naga-karthik good point. I was hoping @entiri would be able to look into this a bit more post parallel training PR fix but ran into quite a bit road block getting metrics back for his parallel training. Let's talk about a bit more now that I have Compute Canada access and should be able to test larger data loading in conjunction with the latest singularity fixes.

uzaymacar commented 2 years ago

As per the ivadomed meeting today, I investigated how we do things currently vs. what MONAI does. Following up on @mpompolas's suggestion, here is what I came up with:

Screen Shot 2022-04-12 at 12 59 45 AM

To briefly summarize:

I think a nice next step would be to add a key to our dataset classes that reflects the pre-computation strategy with options persistent and cache (default in ivadomed). persistent can be implemented by saving the contents of the self.handlers list to directory, removing the contents from memory, and re-reading the contents during __get_item(). Another topic of discussion for the near-future would be to see if we can integrate MONAI's dataset classes as well (in parallel to the models effort in #1116).