johndpope / Emote-hack

Emote Portrait Alive - using ai to reverse engineer code from white paper. (abandoned)
https://github.com/johndpope/VASA-1-hack
169 stars 9 forks source link

Implement the model architecture: #2

Closed johndpope closed 7 months ago

johndpope commented 7 months ago

Image

Build the core neural network modules (VAE, ReferenceNet, feature extractor, etc.). Ensure that each module is independently testable.

johndpope commented 7 months ago

Random Questions:

first cut of VAE looks wrong. https://github.com/johndpope/Emote-hack/blob/main/VAEEncoder.py this doesn't have the unet conditioning.

reference_unet: UNet2DConditionModel, denoising_unet: UNet3DConditionModel,

https://github.com/MooreThreads/Moore-AnimateAnyone/blob/master/train_stage_1.py#L52

how can I create a simple backbone network?

for "iteratively denoise" - can I use the signal to noise code here? https://github.com/MooreThreads/Moore-AnimateAnyone/blob/master/train_stage_1.py#L97

johndpope commented 7 months ago

animate-anyone 245kb - https://gist.github.com/johndpope/3af3e1466c32d9540295d460e0309cf9

BasicTransformerBlock
TemporalBasicTransformerBlock
TemporalTransformer3DModelOutput
VanillaTemporalModule
TemporalTransformer3DModel
TemporalTransformerBlock
PositionalEncoding
VersatileAttention
ReferenceAttentionControl
PoseGuider
InflatedConv3d
InflatedGroupNorm
Upsample3D
Downsample3D
ResnetBlock3D
Mish
Transformer2DModelOutput
Transformer2DModel
Transformer3DModelOutput
Transformer3DModel
AutoencoderTinyBlock
UNetMidBlock2D
UNetMidBlock2DCrossAttn
CrossAttnDownBlock2D
DownBlock2D
CrossAttnUpBlock2D
UpBlock2D
UNet2DConditionOutput
UNet2DConditionModel
UNetMidBlock3DCrossAttn
CrossAttnDownBlock3D
DownBlock3D
CrossAttnUpBlock3D
UpBlock3D
UNet3DConditionOutput
UNet3DConditionModel

x2

johndpope commented 7 months ago

https://github.com/bendanzzc/AnimateAnyone-reproduction

ControlNetSDVModel
├── Attributes
│   ├── sample_size
│   ├── in_channels: int
│   ├── out_channels: int
│   ├── conv_in: nn.Conv2d
│   ├── time_proj: Timesteps
│   ├── time_embedding: TimestepEmbedding
│   ├── down_blocks: nn.ModuleList
│   ├── controlnet_down_blocks: nn.ModuleList
│   └── mid_block: UNetMidBlockSpatioTemporal
└── Operations
    ├── __init__
    ├── forward
    ├── from_unet
    ├── attn_processors
    ├── set_attn_processor
    ├── set_default_attn_processor
    └── set_attention_slice

ControlNetOutput
├── Attributes
│   ├── down_block_res_samples: Tuple[torch.Tensor]
│   └── mid_block_res_sample: torch.Tensor
└── Operations
    ├── __init__
    └── __repr__

ControlNetConditioningEmbeddingSVD
├── Attributes
│   ├── conv_in: nn.Conv2d
│   ├── blocks: nn.ModuleList
│   └── conv_out
└── Operations
    ├── __init__
    └── forward

TimestepEmbedding
├── Attributes
│   ├── emb_dim
│   └── max_period
└── Operations
    ├── __init__
    └── __call__

UNetMidBlockSpatioTemporal
├── Attributes
│   ├── in_channels
│   ├── temb_channels
│   └── cross_attention_dim
└── Operations
    ├── __init__
    └── forward

StableVideoDiffusionPipelineControlNet (inherits from DiffusionPipeline)
├── Attributes
│   ├── vae: AutoencoderKLTemporalDecoder
│   ├── image_encoder: CLIPVisionModelWithProjection
│   ├── unet: UNetSpatioTemporalConditionControlNetModel
│   ├── controlnet: ControlNetSDVModel
│   ├── scheduler: EulerDiscreteScheduler
│   ├── feature_extractor: CLIPImageProcessor
│   ├── model_cpu_offload_seq: str
│   ├── _callback_tensor_inputs: List[str]
│   └── image_processor: VaeImageProcessor
└── Methods
    ├── __init__
    ├── _encode_image
    ├── _encode_vae_image
    ├── _get_add_time_ids
    ├── decode_latents
    ├── check_inputs
    ├── prepare_latents
    ├── guidance_scale (property)
    ├── do_classifier_free_guidance (property)
    ├── num_timesteps (property)
    └── __call__

StableVideoDiffusionPipelineOutput (inherits from BaseOutput)
├── Attributes
│   └── frames: Union[List[PIL.Image.Image], np.ndarray]
└── Methods (inherited from BaseOutput)
    ├── __init__
    └── __repr__

Utility Functions
├── _get_add_time_ids
├── _append_dims
├── tensor2vid
├── _resize_with_antialiasing
├── _compute_padding
├── _filter2d
├── _gaussian
└── _gaussian_blur2d

UNetSpatioTemporalConditionControlNetModel
├── Inherits: ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin
├── Attributes
│   ├── sample_size: Optional[int]
│   ├── in_channels: int
│   ├── out_channels: int
│   ├── down_block_types: Tuple[str]
│   ├── up_block_types: Tuple[str]
│   ├── block_out_channels: Tuple[int]
│   ├── addition_time_embed_dim: int
│   ├── projection_class_embeddings_input_dim: int
│   ├── layers_per_block: Union[int, Tuple[int]]
│   ├── cross_attention_dim: Union[int, Tuple[int]]
│   ├── transformer_layers_per_block: Union[int, Tuple[int], Tuple[Tuple]]
│   ├── num_attention_heads: Union[int, Tuple[int]]
│   ├── num_frames: int
│   ├── upcast_attention: bool
│   ├── conv_in: nn.Conv2d
│   ├── time_proj: Timesteps
│   ├── time_embedding: TimestepEmbedding
│   ├── add_time_proj: Timesteps
│   ├── add_embedding: TimestepEmbedding
│   ├── down_blocks: nn.ModuleList
│   ├── mid_block: UNetMidBlockSpatioTemporal
│   ├── up_blocks: nn.ModuleList
│   ├── conv_norm_out: nn.GroupNorm
│   ├── conv_act: nn.SiLU
│   └── conv_out: nn.Conv2d
└── Operations
    ├── __init__
    ├── attn_processors
    ├── set_attn_processor
    ├── set_default_attn_processor
    ├── _set_gradient_checkpointing
    ├── enable_forward_chunking
    └── forward

UNetSpatioTemporalConditionOutput (Dataclass)
├── Attributes
│   └── sample: torch.FloatTensor
└── Inherits: BaseOutput
curui commented 7 months ago

Looking forward to your updates

johndpope commented 7 months ago

thanks for the encouragement.

UPDATE:

AnimateAnyone (MooreThreads) is ripped off from magicanimate - (bytedance) https://showlab.github.io/magicanimate/

and magic-animate actually define ReferenceNet in their paper.

Common Classes:
Both files contain several classes in common, such as:

BasicTransformerBlock
CrossAttnDownBlock3D
CrossAttnUpBlock3D
DownBlock3D
Downsample3D
InflatedConv3d
Mish
PositionalEncoding
ResnetBlock3D
TemporalTransformer3DModel
TemporalTransformer3DModelOutput
TemporalTransformerBlock
Transformer2DModel
Transformer2DModelOutput
Transformer3DModel
Transformer3DModelOutput
UNet2DConditionOutput
UNet3DConditionModel
UNet3DConditionOutput
UNetMidBlock3DCrossAttn
UpBlock3D
Upsample3D
VanillaTemporalModule
VersatileAttention
Common Functions in a Shared Class:
The only shared class with common functions identified between the files is UNet3DConditionModel, which contains functions:
johndpope commented 7 months ago

drafted