Implement iterabledataset for pretraining

tanzir5 commented 5 months ago

After a 3-4 hours of research, iterabledataset seems to be the right way to train on datasets that are too huge to load entirely in-memory.

The key point is a gpu will only load one batch at a time and discard the previous batch before loading a new batch. This will ensure no more than one batch's memory is required at a time by a GPU.

The customIterableDataset may look like this:

class CustomDataset(IterableDataset):
    def __init__(self, file_paths):
        super(CustomDataset, self).__init__()
        self.file_paths = file_paths

    def __iter__(self):
        for file_path in self.file_paths:
            with h5py.File(file_path, 'r') as f:
                for data in f['data']:
                    yield torch.tensor(data, dtype=torch.float32)

Then the training can be done like this:

dataset = CustomDataset(file_paths)
loader = DataLoader(dataset, batch_size=32)

# Setup the model
model = SimpleModel()

# Setup the trainer
trainer = Trainer(
    strategy=DDPStrategy(process_group_backend="mpi"),
    gpus=4,
    num_nodes=1,
    max_epochs=10
)

# Train the model
trainer.fit(model, loader)

tanzir5 commented 5 months ago

This is particularly important because learning rate schedulers and some other things work at the epoch level (they get updated after every epoch for example). We should not mess with the epoch definition of pytorch lightning trainer as that can make the model perform worse. Also this method ensures minimal memory consumption.

tanzir5 commented 5 months ago

Also using H5py seems prudent since it does lazy loading.

tanzir5 commented 5 months ago

This has been implemented in https://github.com/odissei-lifecourse/life-sequencing-dutch/commit/2e6b59fbd56ab1e2a34a4631bb73781c6d244387

Initial testing shows it is working as expected. We need to run on the entire dataset to completely verify.

f-hafner commented 4 months ago

this has worked for the single GPU case, and was merged in #44 . closing

odissei-lifecourse / life-sequencing-dutch

Implement iterabledataset for pretraining #24