Closed frostedoyster closed 3 weeks ago
Does this handle multiple GPU in a computer without slurm? If not, could we get it to work?
@Luthaf No, it doesn't handle that. In the case that you mentioned (e.g. one machine with two GPUs, or one HPC node where multiple GPUs are avaliable but only one process is created), the correct way to access multi-GPU training would be through our multi-gpu
device option (which is not implemented for any architecture at the moment however). Since the implementation would be quite a lot different there (probably involving DataParallel
from torch instead of DistributedDataParallel
, I feel like that's for a different PR
EDIT: actually it should be quite easy to implement that for PET, because the PET fitting function already handles that case, it should just be a matter of passing the right arguments to PET. (Still for a different PR though.) The main downsides of DataParallel
in practice are that it doesn't scale past one node (in practice this limits you to 2/4 GPUs on HPC clusters) and that it's quite a lot slower than DistributedDataParallel
according to torch
Ok, fair enough for the DataParallel
approach when working on a single node. I agree that it will be slower than Distributed, but the point is to enable new use cases (i.e. user with multiple GPU in a workstation, without access to MPI or schedulers). All of this can wait for a later PR (might be worth to open a new issue for it).
Ok, I will open an issue for multi-gpu
using DataParallel
.
I agree regarding the tests, they're there but they require slurm and multiple GPUs to be run
Reopening #179
Ran fine on 16 H100s from CSCS across 4 nodes
📚 Documentation preview 📚: https://metatrain--239.org.readthedocs.build/en/239/