Closed CFOP-xyn closed 5 months ago
It's possible. But you gonna need more storage to save the the precomputated VAE features.
Thank you for your response. I have a training question regarding the MDTv2_S_2 model. When I train directly in the pixel space (3×256×256), the loss decreases very quickly (from ~4 to ~0.2); when I train in the latent space (first encoding to the latent space with a size of 4×32×32 using a VAE), the loss barely changes (around ~3). Since my dataset is remote sensing images, I trained a VAE myself for encoding and decoding, used for latent space @normalization, with a scale of approximately 0.8333. Is it because the VAE training is not good enough?
The two images above are the distributions of latent vectors (4×32×32 → 4096) after being encoded by my VAE, visualized using t-SNE dimensionality reduction. One is the visualization of 20,000 vectors, and the other is for 100,000 vectors.
Latest training results: the loss still hasn't dropped now (staying around 3), but from the sampling results it seems to be ok? I'm skeptical if the model is simply memorizing certain images, is there any way to tell if the model is memorizing images rather than creating them? Thank you!
1) In this codebase, we didn't do so, therefore the loss is not a meaningful metric here. You can consider to check losses under different time steps, which is more meaningful. 2) Due to the diffusion process, you cannot expect the model to memorize the exact same image because of the noise added in each diffusion step.
Thank you very much, I will check it out.
Hello author, I would like to ask a stupid question: currently I am in the learning stage, due to GPU resource limitation, I can't directly put VAE and MDT on GPU at the same time, if I save the training image (3, 256, 256) in advance after VAE encoded latent, and then the latent training MDT, so that only need to put the MDT on GPU, this kind of Is this approach feasible? Thanks!