Closed Taited closed 1 year ago
Hi @Taited Thanks for your interest in our work!!
As stated in #15:
In Figure 2 of the paper, you can see that the textual prompt q is a simple, predefined prompt like "a photo of a model wearing a dress," "a photo of a model wearing a lower body garment," or "a photo of a model wearing an upper body garment." This prompt serves as a starting point for the diffusion process. It is not tailored to each specific image in the dataset; rather, it provides a general direction for the model to follow during the virtual try-on task. We then use the textual inversion adapter $F_{\theta}$ to predict the pseudo-word embeddings associated with that specific garment. Finally, we condition the denoising network using the features extracted from the concatenation of the generic prompt plus the predicted pseudo-word embeddings.
However, in 2nd row of Table 4, we also provide an experiment without the textual inversion technique but using a textual description of the in-shop garment. You can find the textual description of each garment in the data/noun_chunks
folder.
To extract these textual descriptions we follow the approach described in https://arxiv.org/abs/2304.02051
Alberto
Hi, thank you for your nice work. But I would like to ask how to obtain text prompt for training. It seems the VITON-HD dataset did not provide text prompt.