Closed bob80333 closed 4 years ago
@bob80333 Hi Eric, this is a great finding. I originally transcribed this repository from another re-implementation, so it may not be completely accurate. Do you know if the discriminator also caps the channels at what I presume is the latent z dimension? Does this hold if you go even higher, to say 2048?
@bob80333 let me know, and I'll make the change!
Here's the Discriminator from the same run on the original repo, I'll try changing the latent dimension to see if the cap is the latent dimension.
D Params OutputShape WeightShape
--- --- --- ---
images_in - (?, 3, 1024, 1024) -
labels_in - (?, 0) -
1024x1024/FromRGB 128 (?, 32, 1024, 1024) (1, 1, 3, 32)
1024x1024/Conv0 9248 (?, 32, 1024, 1024) (3, 3, 32, 32)
1024x1024/Conv1_down 18496 (?, 64, 512, 512) (3, 3, 32, 64)
1024x1024/Skip 2048 (?, 64, 512, 512) (1, 1, 32, 64)
512x512/Conv0 36928 (?, 64, 512, 512) (3, 3, 64, 64)
512x512/Conv1_down 73856 (?, 128, 256, 256) (3, 3, 64, 128)
512x512/Skip 8192 (?, 128, 256, 256) (1, 1, 64, 128)
256x256/Conv0 147584 (?, 128, 256, 256) (3, 3, 128, 128)
256x256/Conv1_down 295168 (?, 256, 128, 128) (3, 3, 128, 256)
256x256/Skip 32768 (?, 256, 128, 128) (1, 1, 128, 256)
128x128/Conv0 590080 (?, 256, 128, 128) (3, 3, 256, 256)
128x128/Conv1_down 1180160 (?, 512, 64, 64) (3, 3, 256, 512)
128x128/Skip 131072 (?, 512, 64, 64) (1, 1, 256, 512)
64x64/Conv0 2359808 (?, 512, 64, 64) (3, 3, 512, 512)
64x64/Conv1_down 2359808 (?, 512, 32, 32) (3, 3, 512, 512)
64x64/Skip 262144 (?, 512, 32, 32) (1, 1, 512, 512)
32x32/Conv0 2359808 (?, 512, 32, 32) (3, 3, 512, 512)
32x32/Conv1_down 2359808 (?, 512, 16, 16) (3, 3, 512, 512)
32x32/Skip 262144 (?, 512, 16, 16) (1, 1, 512, 512)
16x16/Conv0 2359808 (?, 512, 16, 16) (3, 3, 512, 512)
16x16/Conv1_down 2359808 (?, 512, 8, 8) (3, 3, 512, 512)
16x16/Skip 262144 (?, 512, 8, 8) (1, 1, 512, 512)
8x8/Conv0 2359808 (?, 512, 8, 8) (3, 3, 512, 512)
8x8/Conv1_down 2359808 (?, 512, 4, 4) (3, 3, 512, 512)
8x8/Skip 262144 (?, 512, 4, 4) (1, 1, 512, 512)
4x4/MinibatchStddev - (?, 513, 4, 4) -
4x4/Conv 2364416 (?, 512, 4, 4) (3, 3, 513, 512)
4x4/Dense0 4194816 (?, 512) (8192, 512)
Output 513 (?, 1) (512, 1)
scores_out - (?, 1) -
--- --- --- ---
Total 29012513
It appears that the cap is separate from the latents:
Cap: https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py#L315
Latents: https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py#L254 https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py#L256 https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py#L259 https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py#L309
Also, the 4x4 constant should have the same number of channels as the first convolution (in the 1024x1024 case 512 channels).
@bob80333 gotcha! thanks for digging into this! I originally built the repository so people can get a first-hand experience of disentanglement, but now I see a bunch of people wanting to reach for higher resolutions, so I'll try to make it scale better!
I was interested in trying out the data augmentations that you added, but I wanted to make sure that the models were the same, so I created a 1024 resolution one in a colab notebook and noticed the difference in channels. I might do some more looking into the training details to try to ensure similar results between the nvidia implementation and this one, since I don't know of any other implementations with data augmentation.
@bob80333 the data augmentation technique is so simple, it almost doesn't warrant any implementation. Karras' own paper on data augmentation had some sort of curriculum learning built in based on some heuristic, but other papers did just fine augmenting the images with some fixed probability
At 1024x1024 (default settings other than image_size), the model from this repo's channels look like this:
number of parameters: 1237966560
For contrast, the original tensorflow implementation looks like this for 1024x1024:
In the original implementation, the highest number of channels a convolution has is 512, but in this implementation, it grows to 8192?