Hi guys, I tried to train on GTX1080Ti, Ubuntu 16.04, cuda 8, cudnn 6 with image sizes 640x480, and config:
train:
snapshot_save_iterations: 5000 # How often do you want to save trained models
image_save_iterations: 2500 # How often do you want to save output images during training
image_display_iterations: 100
display: 1 # How often do you want to log the training stats
snapshot_prefix: ../outputs/unit/night2day/ # Where do you want to save the outputs
hyperparameters:
trainer: COCOGANTrainer
lr: 0.0001 # learning rate
ll_direct_link_w: 100 # weight on the self L1 reconstruction loss
kl_direct_link_w: 0.1 # weight on VAE encoding loss
ll_cycle_link_w: 100 # weight on the cycle L1 reconstruction loss
kl_cycle_link_w: 0.1 # weight on the cycle L1 reconstruction loss
gan_w: 10 # weight on the adversarial loss
batch_size: 1 # image batch size per domain
max_iterations: 500000 # maximum number of training epochs
gen:
name: COCOResGen
ch: 64 # base channel number per layer
input_dim_a: 3
input_dim_b: 3
n_enc_front_blk: 3
n_enc_res_blk: 3
n_enc_shared_blk: 1
n_gen_shared_blk: 1
n_gen_res_blk: 3
n_gen_front_blk: 3
dis:
name: COCODis
ch: 64
input_dim_a: 3
input_dim_b: 3
n_layer: 6
datasets:
train_a: # Domain 1 dataset
channels: 3 # image channel number
scale: 1 # scaling factor for scaling image before processing
crop_image_height: 480 # crop image size
crop_image_width: 640 # crop image size
class_name: dataset_image # dataset class name
root: ../datasets/sg/ # dataset folder location
folder: night/
list_name: lists/night.txt # image list
train_b: # Domain 2 dataset
channels: 3 # image channel number
scale: 1 # scaling factor for scaling image before processing
crop_image_height: 480 # crop image size
crop_image_width: 640 # crop image size
class_name: dataset_image
root: ../datasets/sg/
folder: sunny/
list_name: lists/sunny.txt
However, I encountered out-of memory error as follows:
self.display=1
dataset_image
dataset=dataset_image(conf)
dataset_image
dataset=dataset_image(conf)
Iteration: 00000001/00500000
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
File "cocogan_train.py", line 88, in
main(sys.argv)
File "cocogan_train.py", line 64, in main
image_outputs = trainer.gen_update(images_a, images_b, config.hyperparameters)
File "/media/ml3/Volume/UNIT/src/trainers/cocogan_trainer.py", line 71, in gen_update
total_loss.backward()
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py", line 156, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/init.py", line 98, in backward
variables, grad_variables, retain_graph)
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/function.py", line 91, in apply
return self._forward_cls.backward(self, *args)
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/_functions/basic_ops.py", line 210, in backward
return grad_output.mul(ctx.constant).mul(var.pow(ctx.constant - 1)), None
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py", line 339, in mul
return Mul.apply(self, other)
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/_functions/basic_ops.py", line 48, in forward
return a.mul(b)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCStorage.cu:66
Hi guys, I tried to train on GTX1080Ti, Ubuntu 16.04, cuda 8, cudnn 6 with image sizes 640x480, and config: train: snapshot_save_iterations: 5000 # How often do you want to save trained models image_save_iterations: 2500 # How often do you want to save output images during training image_display_iterations: 100 display: 1 # How often do you want to log the training stats snapshot_prefix: ../outputs/unit/night2day/ # Where do you want to save the outputs hyperparameters: trainer: COCOGANTrainer lr: 0.0001 # learning rate ll_direct_link_w: 100 # weight on the self L1 reconstruction loss kl_direct_link_w: 0.1 # weight on VAE encoding loss ll_cycle_link_w: 100 # weight on the cycle L1 reconstruction loss kl_cycle_link_w: 0.1 # weight on the cycle L1 reconstruction loss gan_w: 10 # weight on the adversarial loss batch_size: 1 # image batch size per domain max_iterations: 500000 # maximum number of training epochs gen: name: COCOResGen ch: 64 # base channel number per layer input_dim_a: 3 input_dim_b: 3 n_enc_front_blk: 3 n_enc_res_blk: 3 n_enc_shared_blk: 1 n_gen_shared_blk: 1 n_gen_res_blk: 3 n_gen_front_blk: 3 dis: name: COCODis ch: 64 input_dim_a: 3 input_dim_b: 3 n_layer: 6 datasets: train_a: # Domain 1 dataset channels: 3 # image channel number scale: 1 # scaling factor for scaling image before processing crop_image_height: 480 # crop image size crop_image_width: 640 # crop image size class_name: dataset_image # dataset class name root: ../datasets/sg/ # dataset folder location folder: night/ list_name: lists/night.txt # image list train_b: # Domain 2 dataset channels: 3 # image channel number scale: 1 # scaling factor for scaling image before processing crop_image_height: 480 # crop image size crop_image_width: 640 # crop image size class_name: dataset_image root: ../datasets/sg/ folder: sunny/ list_name: lists/sunny.txt
However, I encountered out-of memory error as follows: self.display=1 dataset_image dataset=dataset_image(conf) dataset_image dataset=dataset_image(conf) Iteration: 00000001/00500000 THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory Traceback (most recent call last): File "cocogan_train.py", line 88, in
main(sys.argv)
File "cocogan_train.py", line 64, in main
image_outputs = trainer.gen_update(images_a, images_b, config.hyperparameters)
File "/media/ml3/Volume/UNIT/src/trainers/cocogan_trainer.py", line 71, in gen_update
total_loss.backward()
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py", line 156, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/init.py", line 98, in backward
variables, grad_variables, retain_graph)
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/function.py", line 91, in apply
return self._forward_cls.backward(self, *args)
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/_functions/basic_ops.py", line 210, in backward
return grad_output.mul(ctx.constant).mul(var.pow(ctx.constant - 1)), None
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/variable.py", line 339, in mul
return Mul.apply(self, other)
File "/home/ml3/.conda/envs/torch/lib/python2.7/site-packages/torch/autograd/_functions/basic_ops.py", line 48, in forward
return a.mul(b)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCStorage.cu:66
Thanks for your comments!