confusoin about renset50 encoder for global descriptor es

johndpope / MegaPortrait-hack

Using Claude Opus to reverse engineer code from MegaPortraits: One-shot Megapixel Neural Head Avatars

82 stars 8 forks source link

confusoin about renset50 encoder for global descriptor es #31

Closed hazard-10 closed 5 months ago

hazard-10 commented 5 months ago

appendix said the encoder for es used resnet50 with custom resblock at 11c , which is the resbock contains the spade norm. I saw you implementatied SPADE with avatar embeding. But in Eapp it seems you just used a vanilla resnet50 with no custom resblocks ?

I am also a bit confused if the appendix is refering to custom block at 10c or 11c. Since 11c is used for avatar specifc distillation in student model, but es is obvisouly a general latent embedding that can accept arbitrary input. Unless there is additional mapping somewhere else, it is not possible to feed out of distribution identity into Eapp if that resnet50 block also used SPADE . Plus author mentioned "where 𝑛 denotes the dimension of a convolutional layer (either 2D or 3D) and x denotes the number of output channels", and only figure10c fits that description.

johndpope commented 5 months ago

I double check later - I thought I could avoid Spade as it was related to super res training stage. There is slightly more than vanilla - I had to rip apart some things - and use custom code to restore the weights.

The generator / discriminator does actually train and spit out images. I need to get the warping / cropping in order / as well as more videos and dynamic driving video. It’s working on 512x512 for time being.

https://github.com/johndpope/MegaPortrait-hack/issues/27

johndpope commented 5 months ago

my latest merge of PR seems like it's training. The MetaPortrait codebase has super res + SPADE module.

johndpope commented 5 months ago

FYI - https://github.com/johndpope/MegaPortrait-hack/issues/36

May not need es

hazard-10 commented 5 months ago

FYI - #36

May not need es

Yeah i saw that note. Still VASA-1's encoder generate an "identity code" along side vapp / z_dyn. I am guessing we still need some vector to represent identity if true disentangled is the aim, but the identity code could directly be extracted from a few more conv layers that ortho project volume into 2d after Eapp's 3d resblocks, instead of a separate resnet50.

johndpope commented 5 months ago

when i worked on emote paper - they use something similiar but maybe better

https://github.com/johndpope/Emote-hack/blob/7ee104354d52a5461504c27b9f38d269eac86893/Net.py#L56

i could never get the motion frames to concatenate in channel dimension..