out of memory training - Githubissues

reinaldomaslim commented 6 years ago

Hi guys, I tried to train on GTX1080Ti, Ubuntu 16.04, cuda 8, cudnn 6 with image sizes 640x480, and config: train: snapshot_save_iterations: 5000 # How often do you want to save trained models image_save_iterations: 2500 # How often do you want to save output images during training image_display_iterations: 100 display: 1 # How often do you want to log the training stats snapshot_prefix: ../outputs/unit/night2day/ # Where do you want to save the outputs hyperparameters: trainer: COCOGANTrainer lr: 0.0001 # learning rate ll_direct_link_w: 100 # weight on the self L1 reconstruction loss kl_direct_link_w: 0.1 # weight on VAE encoding loss ll_cycle_link_w: 100 # weight on the cycle L1 reconstruction loss kl_cycle_link_w: 0.1 # weight on the cycle L1 reconstruction loss gan_w: 10 # weight on the adversarial loss batch_size: 1 # image batch size per domain max_iterations: 500000 # maximum number of training epochs gen: name: COCOResGen ch: 64 # base channel number per layer input_dim_a: 3 input_dim_b: 3 n_enc_front_blk: 3 n_enc_res_blk: 3 n_enc_shared_blk: 1 n_gen_shared_blk: 1 n_gen_res_blk: 3 n_gen_front_blk: 3 dis: name: COCODis ch: 64 input_dim_a: 3 input_dim_b: 3 n_layer: 6 datasets: train_a: # Domain 1 dataset channels: 3 # image channel number scale: 1 # scaling factor for scaling image before processing crop_image_height: 480 # crop image size crop_image_width: 640 # crop image size class_name: dataset_image # dataset class name root: ../datasets/sg/ # dataset folder location folder: night/ list_name: lists/night.txt # image list train_b: # Domain 2 dataset channels: 3 # image channel number scale: 1 # scaling factor for scaling image before processing crop_image_height: 480 # crop image size crop_image_width: 640 # crop image size class_name: dataset_image root: ../datasets/sg/ folder: sunny/ list_name: lists/sunny.txt

However, I encountered out-of memory error as follows: self.display=1 dataset_image dataset=dataset_image(conf) dataset_image dataset=dataset_image(conf) Iteration: 00000001/00500000 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory Traceback (most recent call last): File "cocogan_train.py", line 88, in main(sys.argv) File "cocogan_train.py", line 64, in main image_outputs = trainer.gen_update(images_a, images_b, config.hyperparameters) File "/media/ml3/Volume/UNIT/src/trainers/cocogan_trainer.py", line 71, in gen_update total_loss.backward() File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py", line 156, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/init.py", line 98, in backward variables, grad_variables, retain_graph) File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/function.py", line 91, in apply return self._forward_cls.backward(self, *args) File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/_functions/basic_ops.py", line 210, in backward return grad_output.mul(ctx.constant).mul(var.pow(ctx.constant - 1)), None File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py", line 339, in mul return Mul.apply(self, other) File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/_functions/basic_ops.py", line 48, in forward return a.mul(b) RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCStorage.cu:66

Thanks for your comments!

reinaldomaslim commented 6 years ago

btw, i make my images smaller to accomodate

omrysendik commented 6 years ago

Hi @reinaldomaslim,

Can you please share how you handled this issue?

mingyuliutw / UNIT

out of memory training #28