Slow training performance, GPU is not fully utilized

Cristy94 commented 3 years ago

I am fine-tuning the celeba model on a GTX 1080ti + Ryzen 7 2700X.

It looks like there is not a steady 100% GPU utilization (as I have with some other projects that I tested, mostly tensorflow):

Any reason why the GPU is not at 100% for the entire training process? Any way to improve this?

It takes 97 seconds for 50 iterations, which I think is a lot.

I am using random masks (mask mode 1), could that be the issue?

Cristy94 commented 3 years ago

I think the issue is here:

                gt_images, masks = self.__cuda__(*items)
                masked_images = gt_images * masks
                self.forward(masked_images, masks, gt_images)
                self.update_parameters()
                self.iter += 1

I am not really familiar with PyTorch, but shouldn't there be a way to pre-compute this stuff, or make it more streamlined, so the GPU doesn't wait for the masked images to be computed?

Cristy94 commented 3 years ago

My guess was that the DataLoader was slowing training down as it was only uploading batch_size tensors to the GPU at once. I thought that a solution would be to upload a lot more images at once to the GPU and then use those images, so the CPU->GPU transfer should happen less.

I implemented a solution like this:

while self.iter<iters:
            for items in train_loader:
                gt_image_batch, mask_batch, masked_image_batch = self.__cuda__(*items)
                batch_size = 6
                # print("New batch of %s elements" %(items[0].size()[0]))

                for batch_idx in range(0, batch_preload_count):
                    left = batch_idx * batch_size
                    right = left + min(batch_size, gt_image_batch.size()[0])
                    gt_image = gt_image_batch[left:right]
                    mask = mask_batch[left:right]
                    masked_image = masked_image_batch[left:right]

                    if gt_image.size()[0] == 0:
                        break

                    # print(len(train_loader), batch_idx, left, right, gt_image, mask, masked_image)
                    self.forward(masked_image, mask, gt_image)
                    self.update_parameters()
                    self.iter += 1

So, I tell the DataLoader to load batch_preload_count * batch_size images in a batch, then we do self.__cuda__(*items) to upload all those images at once, but then for each training iterations we select a batch_size subarray of those preloaded images and forward the network.

So, I also changed in run.py to be able to pass this extra batch_preload_count arg:

dataloader = DataLoader(Dataset(args.data_root, args.mask_root, args.mask_mode, args.target_size, mask_reverse = True), batch_size = args.batch_size * args.batch_preload_count, shuffle = True, num_workers = args.n_threads)
model.train(dataloader, args.model_save_path, args.finetune, args.num_iters, args.batch_preload_count)

TL;DR:

So I patch batch_size as batch_size = args.batch_size * args.batch_preload_count. In training we train like this, for batch_preload_count = 3 for example:

    DATA_LOADER BATCH: [batch_size, batch_size, batch_size]
    UPLOAD TO GPU
    PYTHON FOREACH batch_size IN DATA_LOADER_BATCH:
        FORWARD NETWORK(batch_size)

Unforunately it didn't seem to improve iters/second in any way on my machine.

BUT: Doing something like this allows you to preload more images to the GPU, so if you have more VRAM you can use it by changing the batch_preload_count argument.

Another optimization that I tried was to save masked_image in the dataloader itself instead of computing it in python for each iteration (in theory this should be more efficient as we can compute and create the masked_image in a worker).

Later edit: After further inspection it looks like the GPU CUDA processing is actually 100% utilized (?):

Cristy94 commented 3 years ago

I am closing this issue, as the GPU usage seems to indeed be 100% for the CUDA cores.

If anyone is interested in the loading performance improvements that I tried to make, check out my fork: https://github.com/Cristy94/RFR-Inpainting

jingyuanli001 / RFR-Inpainting

Slow training performance, GPU is not fully utilized #45