junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch
Other
22.8k stars 6.29k forks source link

GPU not using more than 2 #1505

Open stevekangLunit opened 1 year ago

stevekangLunit commented 1 year ago

Hi. I am trying to train the model.

Currently, I am training the model with the following configurations :

num_threads = 64 batch_size = 2

load_size = 1024 crop_size = 512

In this setting, it only uses 2 gpus.

If I increase the batch size, the number of GPU usage increases accordingly.

That is, batch size of 4 results to using 4 GPUs, batch size of 8 results to using 8 GPUs and etc.

However, if I increase the batch size more than 2, CUDA out of memory error pops out.

How can I increase the batch size? is decreasing the load_size only option?

Thank you

Bala93 commented 1 year ago

You gave the number of gpus through gpu ids ?

ArielKes commented 1 year ago

I encountered the same issue, and passed the gpu_ids as such: --gpu_ids 0,1,2,3 when using 'nvidia-smi' command it seems that only one GPU is being utilized. Am I doing something wrong?

deryagol commented 1 year ago

I had the issues with multi-GPU as well. I could not use higher batch sizes than the number of GPUs, otherwise I was getting memory errors. Two things resolved that problem: 1) The batch_size given to torch.utils.data.DataLoader should be per_worker. So, I changed the code under data/init.py as follows: `class CustomDatasetDataLoader(): """Wrapper class of Dataset class that performs multi-threaded data loading"""

def __init__(self, opt):
    """Initialize this class

    Step 1: create a dataset instance given the name [dataset_mode]
    Step 2: create a multi-threaded data loader.
    """
    self.opt = opt
    dataset_class = find_dataset_using_name(opt.dataset_mode)
    self.dataset = dataset_class(opt)
    print("dataset [%s] was created" % type(self.dataset).__name__)
    per_worker_batch_size = opt.batch_size//int(opt.num_threads) if opt.batch_size>=int(opt.num_threads) else 1
    self.dataloader = torch.utils.data.DataLoader(
        self.dataset,
        batch_size=per_worker_batch_size,
        shuffle=not opt.serial_batches,
        num_workers=int(opt.num_threads),
        pin_memory=True)`

2) Then for high batch sizes, I faced with shared memory problem. Since I am using docker, I just increased the --shm-size to 1G while starting the docker container. Now I can use all the GPUs up with a batch size that can fit to their memory. Hope that helps. Of course dont forget setting --gpu_ids 0,1,2,3 and batch_size=per_worker_batch_size*num_workers (16 for example)

junyanz commented 1 year ago

For crop_size=512, batch_size=2 will require a significant amount of GPU memory. I am not sure how much memory you have per GPU.

I am not sure if you need to divide the batch_size by the number of threads before feeding it to DataLoader.

A permanent solution is to replace the Data Parallel with DDP. But we currently don't have the capacity to add it. Any pull request regarding it would be greatly appreciated.

wcyjerry commented 1 year ago

For crop_size=512, batch_size=2 will require a significant amount of GPU memory. I am not sure how much memory you have per GPU.

I am not sure if you need to divide the batch_size by the number of threads before feeding it to DataLoader.

A permanent solution is to replace the Data Parallel with DDP. But we currently don't have the capacity to add it. Any pull request regarding it would be greatly appreciated.

This means do not ddp right now?