torch.multiprocessing.spawn hangs

dialuser commented 1 year ago

When I tried to run first_stage, the code hangs at torch.multiprocessing.spawn(fn=first_stage, args=(args, ), nprocs=args.n_gpus)

After I changed to single GPU, the code ran, however, I kept getting memory error even after reducing channels from 384 to 48. In the paper, it says the model can fit on a single card.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 23.70 GiB total capacity; 20.26 GiB already allocated; 1.34 GiB free; 21.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here's my model configuration,

model:
  resume: False
  amp: True
  base_learning_rate: 1.0e-4
  params:
    embed_dim: 4
    lossconfig:
      params:
        disc_start: 100000000

    ddconfig: 
      double_z: False
      channels: 48 
      resolution: 128 
      timesteps: 8 
      skip: 1
      in_channels: 1 
      out_ch: 1 
      num_res_blocks: 2
      attn_resolutions: []
      splits: 1

dialuser commented 1 year ago

I'm using batch_size=8, is it too many?

sihyun-yu commented 1 year ago

For multi-gpu issue, please try out rm -rf .torch_distributed_init and re-run the code with multi-gpu.

Yes, it seems the batch size is too large; could you use smaller batch size?

sihyun-yu / PVDM

torch.multiprocessing.spawn hangs #11