Question about the speed of reproducing the code of this article

Mengzibin commented 2 years ago

In practice, replicating source code is much slower than this article suggests.

My practical operations are as follows: Download img_align_celeba.zip dataset and unzip it in the specified catalogue. Run the following code "python train_surf.py --output_dir third --curriculum CelebA" The results I got are as follows.

The code shows that completing this program needs about 813 hours and I repeated this procedures, only to get the same results. 6b5ff5fafc5b17971b6b50e1f24a36a

jgkwak95 commented 2 years ago

Hi! Thanks for your attention.

You don't have to worry about "total progress", just check the "stage" (iteration)! The "total progress" above is from our baseline pi-GAN. Just like you, when I first saw this in pi-GAN, I thought training would never end :(

In my experience, a total of 140,000 (32x32: 60,000 and 64x64: 80,000) iterations was enough to train our model on 64x64.

-Jeong-gi

Mengzibin commented 2 years ago

Hi! Thanks for your attention.

You don't have to worry about "total progress", just check the "stage" (iteration)! The "total progress" above is from our baseline pi-GAN. Just like you, when I first saw this in pi-GAN, I thought training would never end :(

In my experience, a total of 140,000 (32x32: 60,000 and 64x64: 80,000) iterations was enough to train our model on 64x64.

-Jeong-gi

Hi!Thanks for your answer.

And I have another questions when i transferred my attention to iterations. When I finished the first stage and was going to get into the next stage, it would show this code:

RuntimeError: CUDA out of memory. Tried to allocate 336.00 MiB (GPU 0; 23.70 GiB total capacity; 18.86 GiB already allocated; 36.56 MiB free; 19.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

But it went well in the first stage. I have tried to modify the batch size but get nothing improved. The memory of a single GPU is 24000MiB and the whole experiment used 5 GPUs.

And Does 'the total of 140,000 iterations' means I need to go through 60,000 iterations (one stage) when I set the image size as 32x32 and need to go through 80,000 iterations when I set the image size as 64x64?

Thanks for your answer anain.

jgkwak95 commented 2 years ago

If you meet OOM, you need to reduce your batch size (per GPU) in curriculum.py In addition, for Multi-GPU training please refer pi-GAN because we modified some parts for single gpu training from pi-gan implementation.

And Does 'the total of 140,000 iterations' means I need to go through 60,000 iterations (one stage) when I set the image size as 32x32 and need to go through 80,000 iterations when I set the image size as 64x64?

Yes

jgkwak95 / SURF-GAN

Question about the speed of reproducing the code of this article #3