Closed tanzir5 closed 4 months ago
This is particularly important because learning rate schedulers and some other things work at the epoch level (they get updated after every epoch for example). We should not mess with the epoch definition of pytorch lightning trainer as that can make the model perform worse. Also this method ensures minimal memory consumption.
Also using H5py seems prudent since it does lazy loading.
This has been implemented in https://github.com/odissei-lifecourse/life-sequencing-dutch/commit/2e6b59fbd56ab1e2a34a4631bb73781c6d244387
Initial testing shows it is working as expected. We need to run on the entire dataset to completely verify.
this has worked for the single GPU case, and was merged in #44 . closing
After a 3-4 hours of research, iterabledataset seems to be the right way to train on datasets that are too huge to load entirely in-memory.
The key point is a gpu will only load one batch at a time and discard the previous batch before loading a new batch. This will ensure no more than one batch's memory is required at a time by a GPU.
The customIterableDataset may look like this:
Then the training can be done like this: