Closed chihchiehchen closed 8 months ago
Hi,
Thank you for the interest. In my experience of running the code with a single V100, I have not encountered the issue you mentioned above. I would try getting an interactive session from the server and try running the code to see whether you are still getting this phenomenon.
Hello,
Thanks for your reply. Though I did not figure out the cause, finally I comment out the logger and everything works well, so I guess the problem came from the conflicts between loggers and HPC default stdout, in any case, thanks a lot!
I have another question regarding to the speed (sorry but I just want to double check I install and run your dciknn code correctly). I try to run Colourization code, with training data 721, testing data 721, around 4 days, now it is at iteration 152600 (epoch 7). Now the validation losses are psnr: 2.2893e+01 lpips: 1.2618e-01. Am I on the right track?
Thanks again for your kindly help!
Hi,
The training time looks reasonable, though slightly longer than I remembered to reach 150k iterations. One thing is that I would typically use "batch_size_per_month" as 400 which means it generates samples and trains on at most 400 real data points at a time. This would help make the training faster than generating samples for all 721 data examples at the same time. You can refer to the example config for more detailed hyperparameter settings.
Hello,
Thanks for the help again. Now I am relieved since the setting is correct. Now I can focus on my research topics. Thanks again !
Hello,
Thanks for sharing your creative work. I want to reproduce the colourization result and run your code on single V100 GPU of HPC server. However, in the beginning of print_log.txt I found something very strange:
I mean, looks like training steps are repeated several times but I do not find anything related to parallel/distributed training on train.py. Is it normal or did I do anything wrong? Could you give me some suggestions?
Thanks for your help.