sghong977 commented 3 months ago

LDM은 먼저 읽자 https://kimjy99.github.io/%EB%85%BC%EB%AC%B8%EB%A6%AC%EB%B7%B0/ldm/

https://bytez.com/docs/arxiv/2302.03011/paper

context와 structure 구분할건데 여기서 structure 보존하는데에 depth 사용 했기 때문에 읽는 논문
train, inference time, GPU cost 언급 없음
기업 논문이라 코드 공개 없는데, 워낙 많이 인용된 논문이다보니 누가 구현해서 공개함

Video Editting

Guide: Image or Text

sghong977 commented 3 months ago

e [14], v parameterization [46]?

[14] Denoising diffusion probabilistic models.
- 원래 e-space의 MSE로 정의해서 모델 학습했음
[46] Progressive distillation for fast sampling of diffusion models.
- sampling step을 줄이는 논문
- 사전 학습된 diffusion model에 대한 N-step DDIM sampler의 동작을 샘플 품질의 저하가 거의 없는 N/2 step의 새 모델로 증류하는 절차를 제시함
- SNR alpha_t^2 / sigma_t^2이 거의 0이 되면 모델 output e의 작은 변화가 output x에 영향이 크다는걸 보완해서, x랑 e를 모두 예측한다음에 보간하는 방식
[13] Imagen video: High definition video generation with diffusion models; v parameterization이 color consistency 개선한다고함
설명: https://kimjy99.github.io/%EB%85%BC%EB%AC%B8%EB%A6%AC%EB%B7%B0/imagen-video/
- distillation할때 쓰나보다

이 논문에서도 v-parameterization을 사용한다.

sghong977 commented 3 months ago

모델 구조

기본

UNet, Autoencoder 구조 (1/8 사이즈로 줄임): 각 이미지에 처리
- spatial transformer blocks로 구성
- spatial self-attention, a cross attention block 이렇게 2가지. 후자는 k,v로 CLIP embedding 사용
latent diffusion model, v-parameterization

Spatio-temporal Latent diffusion

temporal layer 추가. 비디오 처리할때만 사용한다함
- 3D Conv를 하는건 아니고, separable conv처럼 2D spatial conv + temporal 1d conv
- https://kimjy99.github.io/%EB%85%BC%EB%AC%B8%EB%A6%AC%EB%B7%B0/make-a-video/ 이 논문과 유사하게 했다고함
use learnable positional encodings of the frame index into temporal transformer blocks

Structure, Contents

Contents: text
- 학습 시: encode a random frame in each input video with CLIP => 이걸 text로 사용
- 인퍼런스시에 text기반 editing을 위해서, prior model을 학습해서 text embedding으로부터 image embedding을 샘플링할수 있게했음
Structure: depth
- edge 안쓴이유: texture 등의 디테일까지 살아있기 때문. 그보다 더 간단한 구조 정보를 원하기 때문에 depth 사용했다고 서술함
- 0~Ts 범위에서 랜덤하게 blur해서 학습하고, 인퍼런스할땐 ts control 가능함
- 먼저. MiDaS DPT-Large model 모델을 통해 모든 프레임에 대해 depthmap 추출
- 그리고, ts iterations of blurring and downsampling 적용하고, RGB-frame의 resolution에 맞춰 다시 resample함
- 그리고나서 network E를 사용하여 인코딩
- 인풋 zt와 위에서 계산한 depth latent를 concat해서 UNet에 들어감 / ts에 대해 sinusoidal embedding도 4채널으로 만들어서 넣음
Conditioning 방법
- Structure: 픽셀 위치별로 정해져있으니까 concat
- content: text는 별도의 위치가 정해져있지 않으니까 cross-attention 사용
디테일
- 기존에 classifier-free diffusion guidance 논문에서, 먼저 unconditional에 대한 것도 같이 추정하게하고, guidance scale w를 둬서 condition 적용 강도를 적용하게했던 연구가 있음.
- 이 논문에서도 이 방법을 쓸건데 condition이라는걸 temporal 축으로 해석했다고함.
- 솔직히... 구현을 보는게 낫겠다 내가 아는 개념이 아니라서 논문 안와닿음

학습

이미지와 비디오 사용량도, batch size도 어마어마해서... 직접 학습할일은 없겠지만.. 한번 보자

stable diffusion의 pretrained model 가져와서 initialize
CLIP text embedding이 아닌 CLIP image embedding을 condition으로 사용하도록 15000 iter finetuning. 이때는 이미지만 사용해서 학습
temporal connections 연산 추가해서 이미지,비디오에 대해 모두 학습. 75000 iter.
structure condition도 추가해서 학습하는데, ts=0으로 사용. 25000 iter
ts를 0~7사이로 uniformly 랜덤하게 뽑아서 10000 iter 학습.

그래도 depth 모델은 학습 따로 안하고 있는거 그대로 써서 다행이다..

sghong977 commented 3 months ago

코드 봤는데 별로.. 사용할수 있는 형태가 아님

다른 논문들도 슥 보자

VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation
https://lilianweng.github.io/posts/2024-04-12-diffusion-video/

sghong977 commented 3 months ago

ControlVideo

Info

ControlNet을 video로 확장, without any finetuning
structure 사용 가능: canny edge, depth
- finetuning 하나? 같이도 쓸수 있나?

시간

15 frame 생성 2분
100 frame 생성 10분

sghong977 / Daily_AIML

[논문 리뷰] Structure and Content-Guided Video Synthesis with Diffusion Models #37

e [14], v parameterization [46]?

모델 구조

기본

Spatio-temporal Latent diffusion

Structure, Contents

학습

ControlVideo

Info