Multiple "Start training" printed

chihchiehchen commented 8 months ago

Hello,

Thanks for sharing your creative work. I want to reproduce the colourization result and run your code on single V100 GPU of HPC server. However, in the beginning of print_log.txt I found something very strange:

Random Seed: 0 read lmdb keys from cache: /home/chihchieh/CHIMLE/code/data/colorization_dataset/c_filename.lmdb/_keys_cache.p Dataset [ColourizationDataset - Train_Colourization] is created. Number of train images: 721, iters: 2 Total epochs needed: 20 for iters 400,000 read lmdb keys from cache: /home/chihchieh/CHIMLE/code/data/colorization_dataset/c_filename.lmdb/_keys_cache.p Dataset [ColourizationDataset - Validation_Colourization] is created. Number of val images in [Validation_Colourization]: 721 initialization method [kaiming] Remove pixel loss. Loading model from: /home/chihchieh/CHIMLE/code/models/weights/v0.1/vgg.pth Mapping network param multiplier: 0.010000 with total length 64 ---------- Model initialized ------------------ Number of parameters in G: 210,614,288 Number of parameters in F: 14,714,688

Model [CHIMLEModel] is created. ---------- Start training ------------- ---------- validation ------------- Random Seed: 0 read lmdb keys from cache: /home/chihchieh/CHIMLE/code/data/colorization_dataset/c_filename.lmdb/_keys_cache.p Dataset [ColourizationDataset - Train_Colourization] is created. Number of train images: 721, iters: 2 Total epochs needed: 20 for iters 400,000 read lmdb keys from cache: /home/chihchieh/CHIMLE/code/data/colorization_dataset/c_filename.lmdb/_keys_cache.p Dataset [ColourizationDataset - Validation_Colourization] is created. Number of val images in [Validation_Colourization]: 721 initialization method [kaiming] Remove pixel loss. Loading model from: /home/chihchieh/CHIMLE/code/models/weights/v0.1/vgg.pth Mapping network param multiplier: 0.010000 with total length 64 ---------- Model initialized ------------------ Number of parameters in G: 210,614,288 Number of parameters in F: 14,714,688

Model [CHIMLEModel] is created. ---------- Start training ------------- ---------- validation ------------- Random Seed: 0 read lmdb keys from cache: /home/chihchieh/CHIMLE/code/data/colorization_dataset/c_filename.lmdb/_keys_cache.p Dataset [ColourizationDataset - Train_Colourization] is created. Number of train images: 721, iters: 2 Total epochs needed: 20 for iters 400,000 read lmdb keys from cache: /home/chihchieh/CHIMLE/code/data/colorization_dataset/c_filename.lmdb/_keys_cache.p Dataset [ColourizationDataset - Validation_Colourization] is created. Number of val images in [Validation_Colourization]: 721 initialization method [kaiming] Remove pixel loss. Loading model from: /home/chihchieh/CHIMLE/code/models/weights/v0.1/vgg.pth Mapping network param multiplier: 0.010000 with total length 64 ---------- Model initialized ------------------ Number of parameters in G: 210,614,288 Number of parameters in F: 14,714,688

Model [CHIMLEModel] is created. ---------- Start training ------------- ---------- validation ------------- Random Seed: 0 read lmdb keys from cache: /home/chihchieh/CHIMLE/code/data/colorization_dataset/c_filename.lmdb/_keys_cache.p Dataset [ColourizationDataset - Train_Colourization] is created. Number of train images: 721, iters: 2 Total epochs needed: 20 for iters 400,000 read lmdb keys from cache: /home/chihchieh/CHIMLE/code/data/colorization_dataset/c_filename.lmdb/_keys_cache.p Dataset [ColourizationDataset - Validation_Colourization] is created. Number of val images in [Validation_Colourization]: 721 initialization method [kaiming] Remove pixel loss. Loading model from: /home/chihchieh/CHIMLE/code/models/weights/v0.1/vgg.pth Mapping network param multiplier: 0.010000 with total length 64 ---------- Model initialized ------------------ Number of parameters in G: 210,614,288 Number of parameters in F: 14,714,688

Model [CHIMLEModel] is created.

I mean, looks like training steps are repeated several times but I do not find anything related to parallel/distributed training on train.py. Is it normal or did I do anything wrong? Could you give me some suggestions?

Thanks for your help.

niopeng commented 8 months ago

Hi,

Thank you for the interest. In my experience of running the code with a single V100, I have not encountered the issue you mentioned above. I would try getting an interactive session from the server and try running the code to see whether you are still getting this phenomenon.

chihchiehchen commented 8 months ago

Hello,

Thanks for your reply. Though I did not figure out the cause, finally I comment out the logger and everything works well, so I guess the problem came from the conflicts between loggers and HPC default stdout, in any case, thanks a lot!

I have another question regarding to the speed (sorry but I just want to double check I install and run your dciknn code correctly). I try to run Colourization code, with training data 721, testing data 721, around 4 days, now it is at iteration 152600 (epoch 7). Now the validation losses are psnr: 2.2893e+01 lpips: 1.2618e-01. Am I on the right track?

Thanks again for your kindly help!

niopeng commented 8 months ago

Hi,

The training time looks reasonable, though slightly longer than I remembered to reach 150k iterations. One thing is that I would typically use "batch_size_per_month" as 400 which means it generates samples and trains on at most 400 real data points at a time. This would help make the training faster than generating samples for all 721 data examples at the same time. You can refer to the example config for more detailed hyperparameter settings.

chihchiehchen commented 8 months ago

Hello,

Thanks for the help again. Now I am relieved since the setting is correct. Now I can focus on my research topics. Thanks again !

niopeng / CHIMLE

Multiple "Start training" printed #4