Closed johndpope closed 7 months ago
Random Questions:
first cut of VAE looks wrong. https://github.com/johndpope/Emote-hack/blob/main/VAEEncoder.py this doesn't have the unet conditioning.
reference_unet: UNet2DConditionModel, denoising_unet: UNet3DConditionModel,
https://github.com/MooreThreads/Moore-AnimateAnyone/blob/master/train_stage_1.py#L52
how can I create a simple backbone network?
for "iteratively denoise" - can I use the signal to noise code here? https://github.com/MooreThreads/Moore-AnimateAnyone/blob/master/train_stage_1.py#L97
animate-anyone 245kb - https://gist.github.com/johndpope/3af3e1466c32d9540295d460e0309cf9
BasicTransformerBlock
TemporalBasicTransformerBlock
TemporalTransformer3DModelOutput
VanillaTemporalModule
TemporalTransformer3DModel
TemporalTransformerBlock
PositionalEncoding
VersatileAttention
ReferenceAttentionControl
PoseGuider
InflatedConv3d
InflatedGroupNorm
Upsample3D
Downsample3D
ResnetBlock3D
Mish
Transformer2DModelOutput
Transformer2DModel
Transformer3DModelOutput
Transformer3DModel
AutoencoderTinyBlock
UNetMidBlock2D
UNetMidBlock2DCrossAttn
CrossAttnDownBlock2D
DownBlock2D
CrossAttnUpBlock2D
UpBlock2D
UNet2DConditionOutput
UNet2DConditionModel
UNetMidBlock3DCrossAttn
CrossAttnDownBlock3D
DownBlock3D
CrossAttnUpBlock3D
UpBlock3D
UNet3DConditionOutput
UNet3DConditionModel
https://github.com/bendanzzc/AnimateAnyone-reproduction
ControlNetSDVModel
├── Attributes
│ ├── sample_size
│ ├── in_channels: int
│ ├── out_channels: int
│ ├── conv_in: nn.Conv2d
│ ├── time_proj: Timesteps
│ ├── time_embedding: TimestepEmbedding
│ ├── down_blocks: nn.ModuleList
│ ├── controlnet_down_blocks: nn.ModuleList
│ └── mid_block: UNetMidBlockSpatioTemporal
└── Operations
├── __init__
├── forward
├── from_unet
├── attn_processors
├── set_attn_processor
├── set_default_attn_processor
└── set_attention_slice
ControlNetOutput
├── Attributes
│ ├── down_block_res_samples: Tuple[torch.Tensor]
│ └── mid_block_res_sample: torch.Tensor
└── Operations
├── __init__
└── __repr__
ControlNetConditioningEmbeddingSVD
├── Attributes
│ ├── conv_in: nn.Conv2d
│ ├── blocks: nn.ModuleList
│ └── conv_out
└── Operations
├── __init__
└── forward
TimestepEmbedding
├── Attributes
│ ├── emb_dim
│ └── max_period
└── Operations
├── __init__
└── __call__
UNetMidBlockSpatioTemporal
├── Attributes
│ ├── in_channels
│ ├── temb_channels
│ └── cross_attention_dim
└── Operations
├── __init__
└── forward
StableVideoDiffusionPipelineControlNet (inherits from DiffusionPipeline)
├── Attributes
│ ├── vae: AutoencoderKLTemporalDecoder
│ ├── image_encoder: CLIPVisionModelWithProjection
│ ├── unet: UNetSpatioTemporalConditionControlNetModel
│ ├── controlnet: ControlNetSDVModel
│ ├── scheduler: EulerDiscreteScheduler
│ ├── feature_extractor: CLIPImageProcessor
│ ├── model_cpu_offload_seq: str
│ ├── _callback_tensor_inputs: List[str]
│ └── image_processor: VaeImageProcessor
└── Methods
├── __init__
├── _encode_image
├── _encode_vae_image
├── _get_add_time_ids
├── decode_latents
├── check_inputs
├── prepare_latents
├── guidance_scale (property)
├── do_classifier_free_guidance (property)
├── num_timesteps (property)
└── __call__
StableVideoDiffusionPipelineOutput (inherits from BaseOutput)
├── Attributes
│ └── frames: Union[List[PIL.Image.Image], np.ndarray]
└── Methods (inherited from BaseOutput)
├── __init__
└── __repr__
Utility Functions
├── _get_add_time_ids
├── _append_dims
├── tensor2vid
├── _resize_with_antialiasing
├── _compute_padding
├── _filter2d
├── _gaussian
└── _gaussian_blur2d
UNetSpatioTemporalConditionControlNetModel
├── Inherits: ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin
├── Attributes
│ ├── sample_size: Optional[int]
│ ├── in_channels: int
│ ├── out_channels: int
│ ├── down_block_types: Tuple[str]
│ ├── up_block_types: Tuple[str]
│ ├── block_out_channels: Tuple[int]
│ ├── addition_time_embed_dim: int
│ ├── projection_class_embeddings_input_dim: int
│ ├── layers_per_block: Union[int, Tuple[int]]
│ ├── cross_attention_dim: Union[int, Tuple[int]]
│ ├── transformer_layers_per_block: Union[int, Tuple[int], Tuple[Tuple]]
│ ├── num_attention_heads: Union[int, Tuple[int]]
│ ├── num_frames: int
│ ├── upcast_attention: bool
│ ├── conv_in: nn.Conv2d
│ ├── time_proj: Timesteps
│ ├── time_embedding: TimestepEmbedding
│ ├── add_time_proj: Timesteps
│ ├── add_embedding: TimestepEmbedding
│ ├── down_blocks: nn.ModuleList
│ ├── mid_block: UNetMidBlockSpatioTemporal
│ ├── up_blocks: nn.ModuleList
│ ├── conv_norm_out: nn.GroupNorm
│ ├── conv_act: nn.SiLU
│ └── conv_out: nn.Conv2d
└── Operations
├── __init__
├── attn_processors
├── set_attn_processor
├── set_default_attn_processor
├── _set_gradient_checkpointing
├── enable_forward_chunking
└── forward
UNetSpatioTemporalConditionOutput (Dataclass)
├── Attributes
│ └── sample: torch.FloatTensor
└── Inherits: BaseOutput
Looking forward to your updates
thanks for the encouragement.
UPDATE:
AnimateAnyone (MooreThreads) is ripped off from magicanimate - (bytedance) https://showlab.github.io/magicanimate/
and magic-animate actually define ReferenceNet in their paper.
Common Classes:
Both files contain several classes in common, such as:
BasicTransformerBlock
CrossAttnDownBlock3D
CrossAttnUpBlock3D
DownBlock3D
Downsample3D
InflatedConv3d
Mish
PositionalEncoding
ResnetBlock3D
TemporalTransformer3DModel
TemporalTransformer3DModelOutput
TemporalTransformerBlock
Transformer2DModel
Transformer2DModelOutput
Transformer3DModel
Transformer3DModelOutput
UNet2DConditionOutput
UNet3DConditionModel
UNet3DConditionOutput
UNetMidBlock3DCrossAttn
UpBlock3D
Upsample3D
VanillaTemporalModule
VersatileAttention
Common Functions in a Shared Class:
The only shared class with common functions identified between the files is UNet3DConditionModel, which contains functions:
drafted
Build the core neural network modules (VAE, ReferenceNet, feature extractor, etc.). Ensure that each module is independently testable.