CUDA out of memory during training

Jian-danai commented 3 years ago

Hi, what GPU do you use in training? Do you have any suggestions for this issue?

Hi, I changed the gt_size (DISCNet_train.yml) from 256 to 200, it seems the training code can work. But I am not sure whether such a modification is suggested?

jnjaby commented 3 years ago

We use Tesla V100 with 32 GB memory. For GPUs with less memory, you can adjust batch size or patch size to avoid this issue. I think the patch size of 200 would work as long as the size is multiples of 4.

Jian-danai commented 3 years ago

We use Tesla V100 with 32 GB memory. For GPUs with less memory, you can adjust batch size or patch size to avoid this issue. I think the patch size of 200 would work as long as the size is multiples of 4.

Will adjusting batch size or patch size affect the performance of the final model? Thanks.

jnjaby commented 3 years ago

From my experience, patch size and batch size can influence the final performance, but it won't be significant since we have no batch norm layer in our network. You can refer to Sec. 4 in the supplementary file of ESRGAN.

Jian-danai commented 3 years ago

From my experience, patch size and batch size can influence the final performance, but it won't be significant since we have no batch norm layer in our network. You can refer to Sec. 4 in the supplementary file of ESRGAN.

Hi I noticed that your total_epoch=(total_iteration*batch_size_per_gpu*world_size)/(train_set_len*dataset_enlarge_ratio), where world_size=num_gpu. In your default setting, total_iteration=1000000, batch_size_per_gpu=8, world_size=2, dataset_enlarge_ratio=20, and train_set_len (18144) will always remains the same if I use the data you provided. If I only 1 GPU for training, then it seems that I should double the batch_size or double the epoch to keep the total_epoch remain the same?

jnjaby commented 3 years ago

There is a subtle difference between iteration-based and epoch-based training. Iteration determines the number of updates, while epoch means the times of whole dataset to be used. Based on our experience, setting the number of epochs to train won't affect a lot, since the sizes of dataset for low-level vision are commonly small (1k~10k). Anyway, it depends on your own preference. If you focus more on total epoch, just double the batch size or the number of training iterations.

Jian-danai commented 3 years ago

There is a subtle difference between iteration-based and epoch-based training. Iteration determines the number of updates, while epoch means the times of whole dataset to be used. Based on our experience, setting the number of epochs to train won't affect a lot, since the sizes of dataset for low-level vision are commonly small (1k~10k). Anyway, it depends on your own preference. If you focus more on total epoch, just double the batch size or the number of training iterations.

Thank you.

jnjaby / DISCNet

CUDA out of memory during training #10