sail-sg / MDT

Masked Diffusion Transformer is the SOTA for image synthesis. (ICCV 2023)
Apache License 2.0
500 stars 35 forks source link

Traing problem #34

Closed CFOP-xyn closed 5 months ago

CFOP-xyn commented 5 months ago

Hello author, I would like to ask a stupid question: currently I am in the learning stage, due to GPU resource limitation, I can't directly put VAE and MDT on GPU at the same time, if I save the training image (3, 256, 256) in advance after VAE encoded latent, and then the latent training MDT, so that only need to put the MDT on GPU, this kind of Is this approach feasible? Thanks!

gasvn commented 5 months ago

It's possible. But you gonna need more storage to save the the precomputated VAE features.

CFOP-xyn commented 5 months ago

Thank you for your response. I have a training question regarding the MDTv2_S_2 model. When I train directly in the pixel space (3×256×256), the loss decreases very quickly (from ~4 to ~0.2); when I train in the latent space (first encoding to the latent space with a size of 4×32×32 using a VAE), the loss barely changes (around ~3). Since my dataset is remote sensing images, I trained a VAE myself for encoding and decoding, used for latent space @normalization, with a scale of approximately 0.8333. Is it because the VAE training is not good enough?

CFOP-xyn commented 5 months ago

t-SNE-20k t-SNE-100k The two images above are the distributions of latent vectors (4×32×32 → 4096) after being encoded by my VAE, visualized using t-SNE dimensionality reduction. One is the visualization of 20,000 vectors, and the other is for 100,000 vectors.

CFOP-xyn commented 5 months ago

Latest training results: the loss still hasn't dropped now (staying around 3), but from the sampling results it seems to be ok? I'm skeptical if the model is simply memorizing certain images, is there any way to tell if the model is memorizing images rather than creating them? Thank you! infer-19-tif2

gasvn commented 5 months ago

1) In this codebase, we didn't do so, therefore the loss is not a meaningful metric here. You can consider to check losses under different time steps, which is more meaningful. 2) Due to the diffusion process, you cannot expect the model to memorize the exact same image because of the noise added in each diffusion step.

CFOP-xyn commented 5 months ago

Thank you very much, I will check it out.