universome / epigraf

[NeurIPS 2022] Official pytorch implementation of EpiGRAF
https://universome.github.io/epigraf
150 stars 6 forks source link

How should I run the raw source code? #2

Closed tau-yihouxiang closed 2 years ago

tau-yihouxiang commented 2 years ago

Great work and exciting performance! BTW, How should I run the raw source code?

tau-yihouxiang commented 2 years ago

I succeeded run the source code. However, the fist step took 8 Hours. The subsequent process was normal. Do you know the reason? Thank you in advance!

tick 0     kimg 0.0      time 1m 47s       sec/tick 4.3     sec/kimg 266.75  maintenance 102.4  cpumem 5.05   gpumem 16.20  reserved 17.48  augment 0.000
Evaluating metrics for /mnt/home/taohu/Program/rnf-code/logs/epigraf_256/output ...
{"results": {"fid50k_full": 335.9665625183498}, "metric": "fid50k_full", "total_time": 881.277184009552, "total_time_str": "14m 41s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1656167212.192939}
Saving the snapshot... (reason: tick) Saved!
tick 1     kimg 4.0      time 8h 02m 18s   sec/tick 190.4   sec/kimg 47.61   maintenance 28640.7 cpumem 5.44   gpumem 9.37   reserved 10.77  augment 0.001
tick 2     kimg 8.0      time 8h 05m 29s   sec/tick 190.8   sec/kimg 47.71   maintenance 0.1    cpumem 5.44   gpumem 9.37   reserved 10.77  augment 0.000
tick 3     kimg 12.0     time 8h 08m 39s   sec/tick 189.6   sec/kimg 47.41   maintenance 0.8    cpumem 5.44   gpumem 9.37   reserved 10.77  augment 0.000
universome commented 2 years ago

Hi Tau, thank you! I am currently preparing the refactored version of the code, and will try to release it either today or tomorrow.

To be honest, the fact that the first iteration took 8 hours sounds extremely strange and I have never experienced this. Did you try launching StyleGAN3 (we build heavily on it)? Does it have similar behavior? My suspicion is that it was waiting for some blocked resources, like an open port (for distributed training) or compilation cache (to compile custom CUDA kernels). Did you try finding the exact piece of code which takes so much time? And does it happen when relaunching the training?

universome commented 2 years ago

I also noticed that you use fid50k_full as the metric, which generates 50k images from a generator each time. This is quite expensive for a 3D generator, and we always use fid2k_full instead. I will change the default used metric in the released version of our code

tau-yihouxiang commented 2 years ago

Thank you for the information!

universome commented 2 years ago

Hi! We uploaded the source code several hours ago.

tau-yihouxiang commented 2 years ago

Great