this repository aims to reproduce animate anyone which enables avatars to be produced in a consistent video with poses as the guiding prior
the primary dataset is Jafarian's tiktok dataset which contains tiktok dance videos along with pose extractions.
the architecture is directly from animate anyone and uses stable diffusion 1.5 as the core unet architecture
initializations - pose guider (gaussian except zero convolution)
step 1 - exclude temporal layer in video net, freeze vae encoder-decoder and clip encoder, train with single frame including reference net and pose guider to denoise the target image step 2 - introduce temporal layer in video net, fix weights for rest of the network
stability training run - adanw, 2048 batch, lr warmup 10,000 steps then constant