jingyuanli001 / RFR-Inpainting

The source code for CVPR 2020 accepted paper "Recurrent Feature Reasoning for Image Inpainting"
MIT License
355 stars 76 forks source link

Slow training performance, GPU is not fully utilized #45

Closed Cristy94 closed 3 years ago

Cristy94 commented 3 years ago

I am fine-tuning the celeba model on a GTX 1080ti + Ryzen 7 2700X.

It looks like there is not a steady 100% GPU utilization (as I have with some other projects that I tested, mostly tensorflow): image

Any reason why the GPU is not at 100% for the entire training process? Any way to improve this?

It takes 97 seconds for 50 iterations, which I think is a lot.

I am using random masks (mask mode 1), could that be the issue?

Cristy94 commented 3 years ago

I think the issue is here:

                gt_images, masks = self.__cuda__(*items)
                masked_images = gt_images * masks
                self.forward(masked_images, masks, gt_images)
                self.update_parameters()
                self.iter += 1

I am not really familiar with PyTorch, but shouldn't there be a way to pre-compute this stuff, or make it more streamlined, so the GPU doesn't wait for the masked images to be computed?

Cristy94 commented 3 years ago

My guess was that the DataLoader was slowing training down as it was only uploading batch_size tensors to the GPU at once. I thought that a solution would be to upload a lot more images at once to the GPU and then use those images, so the CPU->GPU transfer should happen less.

I implemented a solution like this:

while self.iter<iters:
            for items in train_loader:
                gt_image_batch, mask_batch, masked_image_batch = self.__cuda__(*items)
                batch_size = 6
                # print("New batch of %s elements" %(items[0].size()[0]))

                for batch_idx in range(0, batch_preload_count):
                    left = batch_idx * batch_size
                    right = left + min(batch_size, gt_image_batch.size()[0])
                    gt_image = gt_image_batch[left:right]
                    mask = mask_batch[left:right]
                    masked_image = masked_image_batch[left:right]

                    if gt_image.size()[0] == 0:
                        break

                    # print(len(train_loader), batch_idx, left, right, gt_image, mask, masked_image)
                    self.forward(masked_image, mask, gt_image)
                    self.update_parameters()
                    self.iter += 1

So, I tell the DataLoader to load batch_preload_count * batch_size images in a batch, then we do self.__cuda__(*items) to upload all those images at once, but then for each training iterations we select a batch_size subarray of those preloaded images and forward the network.

So, I also changed in run.py to be able to pass this extra batch_preload_count arg:

dataloader = DataLoader(Dataset(args.data_root, args.mask_root, args.mask_mode, args.target_size, mask_reverse = True), batch_size = args.batch_size * args.batch_preload_count, shuffle = True, num_workers = args.n_threads)
model.train(dataloader, args.model_save_path, args.finetune, args.num_iters, args.batch_preload_count)

TL;DR:

So I patch batch_size as batch_size = args.batch_size * args.batch_preload_count. In training we train like this, for batch_preload_count = 3 for example:

    DATA_LOADER BATCH: [batch_size, batch_size, batch_size]
    UPLOAD TO GPU
    PYTHON FOREACH batch_size IN DATA_LOADER_BATCH:
        FORWARD NETWORK(batch_size)

Unforunately it didn't seem to improve iters/second in any way on my machine.

BUT: Doing something like this allows you to preload more images to the GPU, so if you have more VRAM you can use it by changing the batch_preload_count argument.

Another optimization that I tried was to save masked_image in the dataloader itself instead of computing it in python for each iteration (in theory this should be more efficient as we can compute and create the masked_image in a worker).

Later edit: After further inspection it looks like the GPU CUDA processing is actually 100% utilized (?): image

Cristy94 commented 3 years ago

I am closing this issue, as the GPU usage seems to indeed be 100% for the CUDA cores.

If anyone is interested in the loading performance improvements that I tried to make, check out my fork: https://github.com/Cristy94/RFR-Inpainting