showlab / Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
https://showlab.github.io/Show-1/
Other
1.09k stars 62 forks source link

Questions about the 3D UNet architecture #26

Closed Ground-A-Video closed 10 months ago

Ground-A-Video commented 11 months ago

Hi, while reading the show-1 paper and the code repository, I have some questions regarding the architecture of 3D UNet architecture, especially the first UNet (UNet of key frame generation pipeline), and it will be incredibly helpful if the author could provide some clarifications.

  1. So first, in the Section 3.2 (Turn Image UNet to Video) of the paper, in the middle of the paragraph, it is written that "Additionally following each self and cross-attention layer, we implement a temporal attention layer." Does it mean the attention layers are like "Self-Attn -> Temporal Attn -> Cross-Attn -> Temporal Attn"?

  2. About Fig. 4 (a), Which 3D UNet is this figure indicating? Does it mean the 3D UNet of interpolation and super-resolution pipelines? This figure of UNet is not precisely illustrating the 3D UNet of the key-frame generation pipeline, right?

  3. While reading the code, especially _showone/models/unet_3dblocks.py, the SimpleCrossAttnUpBlock3D and SimpleCrossAttnDownBlock3D are used to construct the 3D UNet (UNet3DConditionModel) right? Then, if I read the forward method of those ~UpBlock3D classes, it seems the code logic is 'ResnetBlock2D --> TemporalConvLayer --> Attention --> TransformerTemporalModel', where TransformerTemporalModel has two sequential Attentions. So it can be summarized that in one (Down/Up/Mid) Block, there are three attentions and let's name them attn1, attn2, attn3 by order. As far as I understood the code, attn1 is the Cross-Attention. Then what are attn2 and attn3? Which one is Spatial Self-Attention and which one is the Temporal Attention as you mentioned in the paper? Or is it 'Cross-Attn --> Temporal Attention --> Temporal Attention'?

Thank you in advance.

junhaozhang98 commented 11 months ago

Thanks for your question and apologies for the late reply. I am currently on busy final exam and other important DDLs.

1) Self-Attn -> Cross-Attn -> Temporal Attn

2) Overall UNet architectures for keyframes, interpolation, superresolution are almost same except number of layers, conv_in channel, etc.

3) 1)SimpleCrossAttnDownBlock3D for the keyframes, the interpolation, the first superresolution; CrossAttnDownBlock3D for the second superresolution. 2)ResnetBlock2D --> TemporalConvLayer --> Attention(attn1-self attention, attn2 cross attention) --> TransformerTemporalModel(all self-temporal attentions).

Ground-A-Video commented 11 months ago

Thank you for your answer, @junhaozhang98 !

Regarding the 3. 2), if I examine the structure of SimpleCrossAttnDownBlock3D as an example, the order is "ResnetBlock2D --> TemporalConvLayer --> Attention --> TransformerTemporalModel ", as you answered. However, in the 'Attention' part, isn't this Attention object composed of only a single Attention? I think this Attention object doesn't have 'self.attn1' nor 'self.attn2'. There are two attentions (attn1, attn2) in the latter TransformerTemporalModel, but not in the Attention object. (Maybe I am understanding the code wrong)