tumurzakov / AnimateDiff

AnimationDiff with train
Apache License 2.0
111 stars 28 forks source link

are you aware of any academic torrents for training set? #18

Open johndpope opened 6 months ago

johndpope commented 6 months ago

did you successfully train on the 10M web videos?

I'm looking at exploring this suggestion for 4 lies of code change. https://github.com/guoyww/AnimateDiff/issues/308

tumurzakov commented 6 months ago

Hello, have you seen that repo? https://github.com/xiaobai1217/Awesome-Video-Datasets

tumurzakov commented 6 months ago

Interesting idea with pointwise conv3d. I'm try to play with it too

tumurzakov commented 5 months ago

https://github.com/hpcaitech/Open-Sora/blob/main/docs/datasets.md

johndpope commented 5 months ago

my post on reddit is blowin up. I guess we're going to find out if Claude3 is a fraud or not pretty soon. https://www.reddit.com/r/StableDiffusion/comments/1bh970h/claude_3_thinks_4_lines_of_code_changes_will/

tumurzakov commented 5 months ago

I make it work only with this setup

class InflatedConv3d(nn.Module):
       def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None):
           super().__init__()
           self.kernel_size = (kernel_size, kernel_size, kernel_size)
           self.stride = (1, stride, stride)
           self.padding = (padding, padding, padding)
           self.depthwise_conv = nn.Conv3d(in_channels, in_channels, self.kernel_size, self.stride, self.padding, groups=in_channels)
           self.pointwise_conv = nn.Conv3d(in_channels, out_channels, kernel_size=1)    

       def forward(self, x):
           x = self.depthwise_conv(x)
           x = self.pointwise_conv(x)
          return x

Currently i got noise because there no params for newly added conv layers. Now i need to retrain this modules in unet.

This means that we can't use sd1.5 checkpoins because all thouse conv layers lays in unet, not in motion_module. Motion module is transformer arch.

tumurzakov commented 5 months ago
Ordinary 0:00:41.196946
Optimized 0:00:33.205643

I got this results, 25% more efficient on infer

johndpope commented 5 months ago

In the part 2 - groupnorm - it may help because it does away we tensor resizing - and does a short cut to use a view instead.

‘’’python class InflatedGroupNorm(nn.GroupNorm): def init(self, num_groups, num_channels, eps=1e-5, affine=True): super().init(num_groups, num_channels, eps, affine)

def forward(self, x):
    batch_size, num_channels, video_length, height, width = x.size()
    x = x.view(batch_size, num_channels, -1)
    x = super().forward(x)
    x = x.view(batch_size, num_channels, video_length, height, width)
    return x

‘’’

tumurzakov commented 5 months ago

Nope, it is too costly to train conv layers on single gpu. I trained for 50000 steps and even not too close to any recognizable picture. Conv layers in sd are feature extractors and we completly replace them. It is task of size to train complete sd...

neph1 commented 5 months ago

Would it be possible to initialize them with the same weights as the layers they're supposed to replace? Then it'd be more of a finetuning task.

tumurzakov commented 5 months ago

Initialize, i dont think so. Not directly. May be there is some kind correlation between conv2d weights and two conv3d but it looks like task in asymetric cryptography (private and public key relationship). I think it could be distilled from sd but it will need second gpu that i dont have. Even if i try to distill it will require enormous efforts for single dev.

I got only 25% performance improvement. It is good but not 100x and it is expected because main computations in sd in attn layers that have O(n**2)

Less parameters in pointwise conv could loss to ordinary conv2 in feature extraction so model overall could perform worse.

I'd like to have opportunity to have two transformers blocks in animatediff but it too costly by vram and I must to decide what to choose, video length or transformer block count.

Anyway, we all will infer our videos on sora like models that is very prommising.

johndpope commented 5 months ago

I asked Claude 3 - (maybe question not so clear and it’s hallucinating) I give it another shot later.

this is wrong-

Yes, it is possible to initialize the depthwise separable convolution layers with the same weights as the layers they are supposed to replace. This way, you can start with a model that has the same initial behavior as the original model and then fine-tune it to adapt to the depthwise separable convolution architecture. To achieve this, you can modify the DepthwiseSeparableInflatedConv3d class to take the weights from the corresponding InflatedConv3d layer and use them to initialize the depthwise and pointwise convolution layers. Here's an updated version of the

DepthwiseSeparableInflatedConv3d class that initializes the weights from the original InflatedConv3d layer:

class DepthwiseSeparableInflatedConv3d(nn.Module):
    def __init__(self, inflated_conv3d_layer):
        super().__init__()
        in_channels = inflated_conv3d_layer.in_channels
        out_channels = inflated_conv3d_layer.out_channels
        kernel_size = inflated_conv3d_layer.kernel_size
        stride = inflated_conv3d_layer.stride
        padding = inflated_conv3d_layer.padding

        self.depthwise_conv = InflatedConv3d(in_channels, in_channels, kernel_size, stride, padding, groups=in_channels)
        self.pointwise_conv = InflatedConv3d(in_channels, out_channels, kernel_size=1)

        # Initialize the depthwise convolution weights from the original InflatedConv3d layer
        depthwise_conv_weights = inflated_conv3d_layer.weight.clone()
        depthwise_conv_weights = depthwise_conv_weights.view(in_channels, 1, *kernel_size)
        self.depthwise_conv.weight = nn.Parameter(depthwise_conv_weights)
        if inflated_conv3d_layer.bias is not None:
            self.depthwise_conv.bias = nn.Parameter(inflated_conv3d_layer.bias.clone())

        # Initialize the pointwise convolution weights from the original InflatedConv3d layer
        pointwise_conv_weights = torch.ones(out_channels, in_channels, 1, 1) / in_channels
        self.pointwise_conv.weight = nn.Parameter(pointwise_conv_weights)

    def forward(self, x):
        x = self.depthwise_conv(x)
        x = self.pointwise_conv(x)
        return x

In this updated implementation: The DepthwiseSeparableInflatedConv3d class takes an instance of the InflatedConv3d layer as input. It extracts the relevant parameters from the InflatedConv3d layer, such as in_channels, out_channels, kernel_size, stride, and padding. The depthwise convolution weights are initialized by cloning the weights from the original InflatedConv3d layer and reshaping them to match the depthwise convolution's weight shape. If the original InflatedConv3d layer has a bias term, it is also cloned and used as the bias for the depthwise convolution. The pointwise convolution weights are initialized with ones divided by the number of input channels, which helps maintain the overall scale of the activations. To use this modified DepthwiseSeparableInflatedConv3d class, you need to pass the corresponding InflatedConv3d layer instance when creating the optimized model. Here's an example:

optimized_layers = []
for layer in original_layers:
    optimized_layer = DepthwiseSeparableInflatedConv3d(layer)
    optimized_layers.append(optimized_layer)
optimized_model = nn.Sequential(*optimized_layers)

By initializing the depthwise separable convolution layers with the weights from the original InflatedConv3d layers, you can start with a model that behaves similarly to the original model. You can then fine-tune this model to adapt to the depthwise separable convolution architecture and potentially achieve better performance or efficiency.

UPDATE

class DepthwiseSeparableInflatedConv3d(nn.Module):
    def __init__(self, inflated_conv3d_layer):
        super().__init__()
        in_channels = inflated_conv3d_layer.in_channels
        out_channels = inflated_conv3d_layer.out_channels
        kernel_size = inflated_conv3d_layer.kernel_size
        stride = inflated_conv3d_layer.stride
        padding = inflated_conv3d_layer.padding

        self.depthwise_conv = nn.Conv2d(in_channels, in_channels, kernel_size, stride, padding, groups=in_channels)
        self.pointwise_conv = nn.Conv2d(in_channels, out_channels, kernel_size=1)

        # Initialize the depthwise convolution weights from the original InflatedConv3d layer
        depthwise_conv_weights = inflated_conv3d_layer.weight.clone()
        depthwise_conv_weights = depthwise_conv_weights.view(in_channels, 1, *kernel_size)
        self.depthwise_conv.weight = nn.Parameter(depthwise_conv_weights)
        if inflated_conv3d_layer.bias is not None:
            self.depthwise_conv.bias = nn.Parameter(inflated_conv3d_layer.bias.clone())

        # Initialize the pointwise convolution weights from the original InflatedConv3d layer
        pointwise_conv_weights = torch.ones(out_channels, in_channels, 1, 1) / in_channels
        self.pointwise_conv.weight = nn.Parameter(pointwise_conv_weights)

    def forward(self, x):
        video_length = x.shape[2]
        x = rearrange(x, "b c f h w -> (b f) c h w")

        x = self.depthwise_conv(x)
        x = self.pointwise_conv(x)

        x = rearrange(x, "(b f) c h w -> b c f h w", f=video_length)
        return x

In this updated implementation: The DepthwiseSeparableInflatedConv3d class takes an instance of the InflatedConv3d layer as input. It extracts the relevant parameters from the InflatedConv3d layer, such as in_channels, out_channels, kernel_size, stride, and padding. The depthwise and pointwise convolution layers are defined using nn.Conv2d instead of InflatedConv3d. This is because the rearrangement of dimensions is handled explicitly in the forward method. The depthwise convolution weights are initialized by cloning the weights from the original InflatedConv3d layer and reshaping them to match the depthwise convolution's weight shape. If the original InflatedConv3d layer has a bias term, it is also cloned and used as the bias for the depthwise convolution. The pointwise convolution weights are initialized with ones divided by the number of input channels, which helps maintain the overall scale of the activations. In the forward method, the input tensor is rearranged to "(b f) c h w" format before applying the depthwise and pointwise convolutions. After the convolutions, the tensor is rearranged back to "b c f h w" format. To use this modified DepthwiseSeparableInflatedConv3d class, you can pass the corresponding InflatedConv3d layer instance when creating the optimized model, similar to the previous example: python

optimized_layers = []
for layer in original_layers:
    optimized_layer = DepthwiseSeparableInflatedConv3d(layer)
    optimized_layers.append(optimized_layer)
optimized_model = nn.Sequential(*optimized_layers)

By initializing the depthwise and pointwise convolution layers with the weights from the original InflatedConv3d layers, the optimized model will start with the same weights as the original model. You can then fine-tune the optimized model to adapt to the depthwise separable convolution architecture.

update 3. I ask if this is bad for performance (Though if this was a once off conversion- makes these points mute.)

Initializing the depthwise separable convolution layers with the weights from the original InflatedConv3d layers can have both advantages and disadvantages in terms of performance. Let's discuss them: Advantages: Faster convergence: By starting with the weights from the original model, the optimized model may converge faster during fine-tuning. The initialized weights provide a good starting point, and the model can adapt to the depthwise separable convolution architecture more quickly. Preserved initial behavior: Initializing the weights from the original model ensures that the optimized model starts with the same behavior as the original model. This can be beneficial if the original model already performs well, and you want to maintain its initial performance while adapting to the new architecture. Reduced training time: With a good initialization, the fine-tuning process may require fewer iterations or epochs to reach the desired performance level. This can lead to reduced training time compared to training the optimized model from scratch with random initialization. Disadvantages: Suboptimal weights: The weights from the original InflatedConv3d layers are optimized for the standard convolution operation, not specifically for the depthwise separable convolution architecture. While these weights can serve as a good starting point, they may not be optimal for the depthwise and pointwise convolutions. The model may need to undergo significant fine-tuning to adapt to the new architecture and achieve the best performance. Limited flexibility: Initializing the weights from the original model may limit the flexibility of the optimized model. The initial weights may bias the model towards the original behavior, making it harder for the model to explore different and potentially better weight configurations during fine-tuning. Increased computational overhead: The process of initializing the weights from the original model and then fine-tuning the optimized model adds computational overhead compared to training the optimized model from scratch. However, this overhead may be offset by the potential benefits of faster convergence and reduced training time. Considering these factors, whether initializing the weights from the original model is a good idea depends on your specific use case and requirements: If you have a well-performing original model and want to quickly adapt it to the depthwise separable convolution architecture while maintaining its initial behavior, initializing the weights from the original model can be a good approach. If you have sufficient computational resources and time, and you want to give the optimized model more flexibility to learn the best weights for the new architecture from scratch, random initialization may be preferred. Ultimately, the choice between initializing the weights from the original model or using random initialization depends on your goals, resources, and the specific characteristics of your problem. It's often a good idea to experiment with both approaches and compare their performance to determine the best strategy for your particular use case. Copy

UPDATE 4 - currently testing to see disk saving...


import torch
import torch.nn as nn
from einops import rearrange

# Define the DepthwiseSeparableInflatedConv3d class
class DepthwiseSeparableInflatedConv3d(nn.Module):
    def __init__(self, inflated_conv3d_layer):
        super().__init__()
        in_channels = inflated_conv3d_layer.in_channels
        out_channels = inflated_conv3d_layer.out_channels
        kernel_size = inflated_conv3d_layer.kernel_size
        stride = inflated_conv3d_layer.stride
        padding = inflated_conv3d_layer.padding

        self.depthwise_conv = nn.Conv2d(in_channels, in_channels, kernel_size, stride, padding, groups=in_channels)
        self.pointwise_conv = nn.Conv2d(in_channels, out_channels, kernel_size=1)

        # Initialize the depthwise convolution weights from the original InflatedConv3d layer
        depthwise_conv_weights = inflated_conv3d_layer.weight.clone()
        depthwise_conv_weights = depthwise_conv_weights.view(in_channels, 1, *kernel_size)
        self.depthwise_conv.weight = nn.Parameter(depthwise_conv_weights)
        if inflated_conv3d_layer.bias is not None:
            self.depthwise_conv.bias = nn.Parameter(inflated_conv3d_layer.bias.clone())

        # Initialize the pointwise convolution weights from the original InflatedConv3d layer
        pointwise_conv_weights = torch.ones(out_channels, in_channels, 1, 1) / in_channels
        self.pointwise_conv.weight = nn.Parameter(pointwise_conv_weights)

    def forward(self, x):
        video_length = x.shape[2]
        x = rearrange(x, "b c f h w -> (b f) c h w")

        x = self.depthwise_conv(x)
        x = self.pointwise_conv(x)

        x = rearrange(x, "(b f) c h w -> b c f h w", f=video_length)
        return x

# Load the pretrained AnimateDiff model
original_model = UNet3DConditionModel.from_pretrained("path/to/animatediff/checkpoint")

# Create an optimized model with DepthwiseSeparableInflatedConv3d layers
optimized_model = UNet3DConditionModel(
    sample_size=original_model.sample_size,
    in_channels=original_model.in_channels,
    out_channels=original_model.out_channels,
    center_input_sample=original_model.center_input_sample,
    flip_sin_to_cos=original_model.flip_sin_to_cos,
    freq_shift=original_model.freq_shift,
    down_block_types=original_model.down_block_types,
    up_block_types=original_model.up_block_types,
    block_out_channels=original_model.block_out_channels,
    layers_per_block=original_model.layers_per_block,
    downsample_padding=original_model.downsample_padding,
    mid_block_scale_factor=original_model.mid_block_scale_factor,
    act_fn=original_model.act_fn,
    norm_num_groups=original_model.norm_num_groups,
    norm_eps=original_model.norm_eps,
    cross_attention_dim=original_model.cross_attention_dim,
    attention_head_dim=original_model.attention_head_dim,
)

# Replace InflatedConv3d layers with DepthwiseSeparableInflatedConv3d layers
for name, module in optimized_model.named_modules():
    if isinstance(module, InflatedConv3d):
        inflated_conv3d_layer = getattr(original_model, name)
        depthwise_separable_conv3d_layer = DepthwiseSeparableInflatedConv3d(inflated_conv3d_layer)
        setattr(optimized_model, name, depthwise_separable_conv3d_layer)

# Save the checkpoint of the optimized model
torch.save(optimized_model.state_dict(), "path/to/optimized/checkpoint.pth")

UPDATE @neph1 I push this code that attempts to load animatediff from checkpoint - and save it with the optimized model. I also asked claude to attempt to make the given models (SpatialAttentionModule/ TemporalAttentionModule) compatible with the new /restored from original weights DepthwiseSeparableInflatedConv3d https://github.com/johndpope/animate/blob/main/animate-anyone/convert_depthwise.py

I got some other fires I need to attend to then will switch back.

UPDATE 6 in an attempt to load the pretrained weights and hot swap them for the DepthwiseSeparableInflatedConv3d i push some code - https://github.com/johndpope/animate/blob/main/animate-anyone/convert_depthwise.py - the ai thinks it can swap the layer out if it can detect the InflatedConv3d layer - which makes total sense to me - but fails. I pause this investigation to return to some other work (recreating the EMO paper using chatgpt / claude3) https://github.com/johndpope/Emote-hack/