VAE with intermediate features takes up more GPU memory than original VAE

miccunifi / ladi-vton

[ACM MM 2023] - LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Other

412 stars 56 forks source link

VAE with intermediate features takes up more GPU memory than original VAE #48

Closed Ganzhi-e closed 5 months ago

Ganzhi-e commented 10 months ago

Hi, your work is so wonderful! Here is some questions.

I noticed that declaring val_pipe in the training code as an instance of StableDiffusionTryOnePipeline will occupy a very large amount of GPU memory, and inference.py itself will also occupy a large amount of GPU memory when running. It would be much better to replace VAE with intermediate features with original VAE. Have you noticed it? May I ask what GPU you are running on when running inference.py?

Thank you!

Aukture commented 5 months ago

facing same issue... any suggesstions how to solve it?

Ganzhi-e commented 5 months ago

facing same issue... any suggesstions how to solve it?

In order to solve the problem of losing details of faces and hands, you can use the repaint method in other works, i.e. x_result = x_generation mask_tensor + (1.0-mask_tensor) x_source, where x_generation is the generated try-on result, x_source is the original person image and mask_tensor is the inpaint mask.

In addition, directly using the inference code with emasc will occupy about 70G GPU memory in NVIDIA A100.

Hope these information will be helpful to you.

omedivad commented 3 months ago

@Ganzhi-e Thank you for suggesting it. We actually tried the stitching method you proposed in the previous comment in the past.

The main problem with the stitching you proposed is that if the keypoint constraint doesn't work as expected (even if the final pose differs from the initial one by a few pixels), the inpainted region will seem copy-pasted on the final image.

This is the reason why we come up with the EMASC module.

I think the stitching choiche is a trade of between image quality and computational constraints. Olso depends on the data you want to apply the method to.

All the training have been performed on a A100.