Closed hazard-10 closed 5 months ago
I double check later - I thought I could avoid Spade as it was related to super res training stage. There is slightly more than vanilla - I had to rip apart some things - and use custom code to restore the weights.
The generator / discriminator does actually train and spit out images. I need to get the warping / cropping in order / as well as more videos and dynamic driving video. Itβs working on 512x512 for time being.
my latest merge of PR seems like it's training. The MetaPortrait codebase has super res + SPADE module.
FYI - https://github.com/johndpope/MegaPortrait-hack/issues/36
May not need es
FYI - #36
May not need es
Yeah i saw that note. Still VASA-1's encoder generate an "identity code" along side vapp / z_dyn. I am guessing we still need some vector to represent identity if true disentangled is the aim, but the identity code could directly be extracted from a few more conv layers that ortho project volume into 2d after Eapp's 3d resblocks, instead of a separate resnet50.
when i worked on emote paper - they use something similiar but maybe better
https://github.com/johndpope/Emote-hack/blob/7ee104354d52a5461504c27b9f38d269eac86893/Net.py#L56
i could never get the motion frames to concatenate in channel dimension..
appendix said the encoder for es used resnet50 with custom resblock at 11c , which is the resbock contains the spade norm. I saw you implementatied SPADE with avatar embeding. But in Eapp it seems you just used a vanilla resnet50 with no custom resblocks ?
I am also a bit confused if the appendix is refering to custom block at 10c or 11c. Since 11c is used for avatar specifc distillation in student model, but es is obvisouly a general latent embedding that can accept arbitrary input. Unless there is additional mapping somewhere else, it is not possible to feed out of distribution identity into Eapp if that resnet50 block also used SPADE . Plus author mentioned "where π denotes the dimension of a convolutional layer (either 2D or 3D) and x denotes the number of output channels", and only figure10c fits that description.