Closed mpompolas closed 2 years ago
So my limited understanding is that PyTorch Dataset/DataLoader can readily handle multi terabyte of training data without issue. But because we are preloading some of the 2d/3d volumes data for various reasons into Bids3DDataset/BidsDataset objects before training time which even though those classes inherit from Dataset/DataLoader and will train okay by only loading bulk of the training relevant data at training time, we are potentially going to see CPU RAM overflow before training at BIDS object formation stage, and not GPU RAM overflow during mini-batch training, is that correct?
Hi, I am trying to configure Ivadomed to use for my Uni project. From my understanding all crops of the dataset are loaded into main-memory before training and even though I am running the code on a cluster node, the RAM is overflowing with a small to medium sized dataset, containing roughly 100 patients. Have you considered any workaround? (I installed ivadomed with pypi and I think the most current version there is 2.3.1) Otherwise I would try modifying the code such that the images are only actually loaded within the getitem function of the dataloader.
Hi @maffos, Thanks for your message!
From my understanding all crops of the dataset are loaded into main-memory before training
Yes, correct!
Have you considered any workaround?
We have been discussing to implement solutions to address this issue, things around "training on the fly". But nothing really started at this stage.
Otherwise I would try modifying the code such that the images are only actually loaded within the getitem function of the dataloader.
That would be really great! I do like the idea you suggested!
Please feel free to create a "Draft" Pull Request so that the ivadomed
team can support you with the integration of the changes etc.
related to #773
Hey @dyt811 ! What's the status of this issue? The priority is already high and there's an open project, should we discuss this in the next meeting? Maybe this should be revived asap because we are about to start training with big datasets in the ivadomed-eeg project with @mpompolas.
@naga-karthik good point. I was hoping @entiri would be able to look into this a bit more post parallel training PR fix but ran into quite a bit road block getting metrics back for his parallel training. Let's talk about a bit more now that I have Compute Canada access and should be able to test larger data loading in conjunction with the latest singularity fixes.
As per the ivadomed meeting today, I investigated how we do things currently vs. what MONAI does. Following up on @mpompolas's suggestion, here is what I came up with:
To briefly summarize:
__get_item()
function of the corresponding dataset classes, i.e. on the fly during training.__init__()
function of our dataset classes. The preprocessed data is cached in the self.handlers
list. See the _load_filenames() and _prepare_indices() helper functions for this.CacheDataset()
class which is analogous to how we do things in ivadomed. It also implements a PersistentDataset()
class that saves the preprocessed data to a directory to be read later as opposed to storing it on RAM. This helps handle accommodate large datasets and is analogous to @jcohenadad's suggestion in the previous ivadomed meeting.I think a nice next step would be to add a key to our dataset classes that reflects the pre-computation strategy with options persistent
and cache
(default in ivadomed). persistent
can be implemented by saving the contents of the self.handlers
list to directory, removing the contents from memory, and re-reading the contents during __get_item()
. Another topic of discussion for the near-future would be to see if we can integrate MONAI's dataset classes as well (in parallel to the models effort in #1116).
This issue is placed in order to start a discussion about the potential options.
The datasets that we have in our disposal are only getting bigger.
We should slowly start thinking how we can speed up the loading process, since right now the loader is loading everything before the computations get started leading to very long waiting times.
We should probably consider partial loading in extreme cases and/or modifications to the loader based on the available RAM