CUDA Error when loading checkpoint on more than one GPU

agonzalezd commented 2 years ago

Hello.

I am having an issue when using your code. If I try to resume a training from a checkpoint when having more than one GPU (I am using docker containers) I get the following error:

  File "__main__.py", line 55, in <module>
    main(parser.parse_args())
  File "__main__.py", line 39, in main
    spawn(train_distributed, args=(replica_count, port, args, params), nprocs=replica_count, join=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/diffwave/src/diffwave/learner.py", line 188, in train_distributed
    _train_impl(replica_id, model, dataset, args, params)
  File "/opt/diffwave/src/diffwave/learner.py", line 163, in _train_impl
    learner.restore_from_checkpoint()
  File "/opt/diffwave/src/diffwave/learner.py", line 95, in restore_from_checkpoint
    checkpoint = torch.load(f'{self.model_dir}/{filename}.pt')
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 584, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 842, in _load
    result = unpickler.load()
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 834, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 823, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 174, in default_restore_location
    result = fn(storage, location)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 156, in _cuda_deserialize
    return obj.cuda(device)
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 77, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

But when starting from scratch or using a single GPU this error does not show and the training goes flawlessly.

I must add that I have checked that the GPUs were completely free when launching the training.

Any advice on this issue?

Thanks in advance.

sharvil commented 2 years ago

Hmm I haven't run across that error before. Sorry, I don't think I'll be of much help here.

agonzalezd commented 2 years ago

It somehow spawns multiple processes on a single GPU, but only on one of them... I am launching the training on 4 GPUs. Three of the GPUs spawn each a single process, but one of them spawns 4 processes. If changing to 3 GPUs, the same happens: one of the GPUs spawns 3 processes. I cannot find a bug in the code that forces this behaviour....

lmnt-com / diffwave

CUDA Error when loading checkpoint on more than one GPU #19