miccunifi / ladi-vton

[ACM MM 2023] - LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
Other
412 stars 56 forks source link

how to prepare text prompt #19

Closed Taited closed 1 year ago

Taited commented 1 year ago

Hi, thank you for your nice work. But I would like to ask how to obtain text prompt for training. It seems the VITON-HD dataset did not provide text prompt.

ABaldrati commented 1 year ago

Hi @Taited Thanks for your interest in our work!!

As stated in #15:

In Figure 2 of the paper, you can see that the textual prompt q is a simple, predefined prompt like "a photo of a model wearing a dress," "a photo of a model wearing a lower body garment," or "a photo of a model wearing an upper body garment." This prompt serves as a starting point for the diffusion process. It is not tailored to each specific image in the dataset; rather, it provides a general direction for the model to follow during the virtual try-on task. We then use the textual inversion adapter $F_{\theta}$ to predict the pseudo-word embeddings associated with that specific garment. Finally, we condition the denoising network using the features extracted from the concatenation of the generic prompt plus the predicted pseudo-word embeddings.

However, in 2nd row of Table 4, we also provide an experiment without the textual inversion technique but using a textual description of the in-shop garment. You can find the textual description of each garment in the data/noun_chunks folder. To extract these textual descriptions we follow the approach described in https://arxiv.org/abs/2304.02051

Alberto