Distributed training - Githubissues

frostedoyster commented 4 weeks ago

Reopening #179

Ran fine on 16 H100s from CSCS across 4 nodes

📚 Documentation preview 📚: https://metatrain--239.org.readthedocs.build/en/239/

Luthaf commented 3 weeks ago

Does this handle multiple GPU in a computer without slurm? If not, could we get it to work?

frostedoyster commented 3 weeks ago

@Luthaf No, it doesn't handle that. In the case that you mentioned (e.g. one machine with two GPUs, or one HPC node where multiple GPUs are avaliable but only one process is created), the correct way to access multi-GPU training would be through our multi-gpu device option (which is not implemented for any architecture at the moment however). Since the implementation would be quite a lot different there (probably involving DataParallel from torch instead of DistributedDataParallel, I feel like that's for a different PR

EDIT: actually it should be quite easy to implement that for PET, because the PET fitting function already handles that case, it should just be a matter of passing the right arguments to PET. (Still for a different PR though.) The main downsides of DataParallel in practice are that it doesn't scale past one node (in practice this limits you to 2/4 GPUs on HPC clusters) and that it's quite a lot slower than DistributedDataParallel according to torch

Luthaf commented 3 weeks ago

Ok, fair enough for the DataParallel approach when working on a single node. I agree that it will be slower than Distributed, but the point is to enable new use cases (i.e. user with multiple GPU in a workstation, without access to MPI or schedulers). All of this can wait for a later PR (might be worth to open a new issue for it).

frostedoyster commented 3 weeks ago

Ok, I will open an issue for multi-gpu using DataParallel. I agree regarding the tests, they're there but they require slurm and multiple GPUs to be run

lab-cosmo / metatrain

Distributed training #239