Some questions about the training process

houjie8888 commented 1 year ago

Thanks to the authors for such influential work.

I would like to confirm if the EMASC module is trained by reconstruction loss, as mentioned in the article for L1 and VGG loss. How is its input constructed, is it a model image I, a mapping of I to \tilde{I}?
During the training of the enhanced Stable Diffusion pipeline, is there any model image I used as input? Because the sampling operation of the model image I appears in Equation 3 and Equation 4.
I am more curious about how the whole training process of diffusion-based models actually works. Not limited to this work, I have similar confusion in other work and I hope to get help from the authors here. What I think about the mechanism of the diffusion model is that it is fitting the distribution of the data set, so what is the underlying principle that it is being applied to the try-on task. What is it fitting in a given task? In other words, similar to the second question, how to effectively use the ground truth picture, i.e. the model picture I.

houjie8888 commented 1 year ago

Is $z_t$ obtained by adding noise to the model image $I$?

ABaldrati commented 1 year ago

Hi @houjie8888,

Thank you for your interest in our work!

To answer your questions:

Yes, the EMASC module is trained using a reconstruction loss. The loss is computed between the reconstructed image ($\hat{I}$) and the original image ($I$). You can refer to Figure 3 for a detailed overview.
During the training of the pipeline, the enhanced denoising Unet is provided with the following inputs: $\gamma=\left[z_t; m; \mathcal{E}(I_M) \right]$ and $\psi=\left[t; T_E(\hat{Y})\right]$. Here, $z_t$ represents the encoded model image $\mathcal{E}(I)$ with added noise.
In this case, the goal is to fit the distribution of the test set. However, it's important to note that the distribution is strongly conditioned by two factors: the in-shop cloth (achieved through textual inversion) and the warped cloth.

I hope this clarifies your doubts. If you have any more questions, feel free to ask. Alberto

siarheidevel commented 1 year ago

Hello How much garment information( texture, logos on garment) is restored during only Textual Inversion stage training? on the image is 80000 iteration step 16 images per batch used 64 words. for less number of words the result is worse

ABaldrati commented 1 year ago

@siarheidevel We did not use the classic iterative textual inversion approach. We did train a textual inversion adapter which performs the inversion in a single forward pass

gamingflexer commented 1 year ago

Thanks for the great work guys, when can we expect the training code?

naivenaive commented 1 year ago

Thanks for your excellent work. I am confused when reading the following sentence from paper Section3.2 Text Inversion: We first build a textual prompt 𝑞 that guides the diffusion process to perform the virtual try-on task, tokenize it and map each token into the token embedding space using the CLIP embedding lookup module, obtaining 𝑉𝑞 . Does that mean your team makes a prompt(𝑞) for each image in the dataset manually and use it as a starting point in the following training? Please help to clarify. Many thanks.

ABaldrati commented 11 months ago

Thanks for the great work guys, when can we expect the training code?

We're totally planning to release the training code, but honestly, we don't have a schedule yet. So many things on our plate right now... Sorry about that!

ABaldrati commented 11 months ago

Thanks for your excellent work. I am confused when reading the following sentence from paper Section3.2 Text Inversion: We first build a textual prompt 𝑞 that guides the diffusion process to perform the virtual try-on task, tokenize it and map each token into the token embedding space using the CLIP embedding lookup module, obtaining 𝑉𝑞 . Does that mean your team makes a prompt(𝑞) for each image in the dataset manually and use it as a starting point in the following training? Please help to clarify. Many thanks.

Hi @naivenaive, thank you for your interest in our work!

Regarding the sentence you mentioned in Section 3.2 of the paper, the process of building the textual prompt q is not done manually for each image in the dataset. Instead, it is a generic prompt that guides the diffusion process for the virtual try-on task.

In Figure 2 of the paper, you can see that the textual prompt q is a simple, predefined prompt like "a photo of a model wearing a dress," "a photo of a model wearing a lower body garment," or "a photo of a model wearing an upper body garment." This prompt serves as a starting point for the diffusion process. It is not tailored to each specific image in the dataset; rather, it provides a general direction for the model to follow during the virtual try-on task. We then use the textual inversion adapter $F_{\theta}$ to predict the pseudo-word embeddings associated with that specific garment. Finally, we condition the denoising network using the features extracted from the concatenation of the generic prompt plus the predicted pseudo-word embeddings.

I hope this clarifies any confusion. If you have any further questions, please feel free to ask.

Best regards, Alberto

gamingflexer commented 11 months ago

Thanks for the great work guys, when can we expect the training code?

We're totally planning to release the training code, but honestly, we don't have a schedule yet. So many things on our plate right now... Sorry about that!

Sure, waiting for it. Take your time!

nazapip commented 11 months ago

Thanks for your excellent work. I am confused when reading the following sentence from paper Section3.2 Text Inversion: We first build a textual prompt 𝑞 that guides the diffusion process to perform the virtual try-on task, tokenize it and map each token into the token embedding space using the CLIP embedding lookup module, obtaining 𝑉𝑞 . Does that mean your team makes a prompt(𝑞) for each image in the dataset manually and use it as a starting point in the following training? Please help to clarify. Many thanks.

Hi @naivenaive, thank you for your interest in our work!

Regarding the sentence you mentioned in Section 3.2 of the paper, the process of building the textual prompt q is not done manually for each image in the dataset. Instead, it is a generic prompt that guides the diffusion process for the virtual try-on task.

In Figure 2 of the paper, you can see that the textual prompt q is a simple, predefined prompt like "a photo of a model wearing a dress," "a photo of a model wearing a lower body garment," or "a photo of a model wearing an upper body garment." This prompt serves as a starting point for the diffusion process. It is not tailored to each specific image in the dataset; rather, it provides a general direction for the model to follow during the virtual try-on task. We then use the textual inversion adapter Fθ to predict the pseudo-word embeddings associated with that specific garment. Finally, we condition the denoising network using the features extracted from the concatenation of the generic prompt plus the predicted pseudo-word embeddings.

I hope this clarifies any confusion. If you have any further questions, please feel free to ask.

Best regards, Alberto

I am still confused about how it works? please correct me if i am wrong.

So first you all give a generic prompt for each target cloth images '"a photo of a model wearing a dress," "a photo of a model wearing a lower body garment," or "a photo of a model wearing an upper body garment", whatever suits to that specific garment. Then the adapter Fθ predicts the visual appearance or texture details of the dress and then the both are concatenated, and the denoising Unet is conditioned? Is it what you meant?

like what does this architecture mean? is it that the cloth image shown will predict the word embedding tokens in a manner "A pink floral design on a white frock" or something similar to it?

I am assuming that V1, V2,...,Vn* are these kind of word embeddings "A pink floral design on a white frock"

ABaldrati commented 10 months ago

@nazapip

The textual inversion adapter $F_{\theta}$ is trained to predict a set of pseudo-word tokens V1,.., Vn that represents the in-shop image in the CLIP token embedding space.

I am assuming that V1, V2,...,Vn* are these kind of word embeddings "A pink floral design on a white frock"

It is likely that the meaning of the predicted embeddings is close to the embeddings of the phrase "A pink floral design on a white frock". However, note that the predicted embeddings are more fine-grained since they are continuous and not quantized (like the word embedding of existing terms).

miccunifi / ladi-vton

Some questions about the training process #15