Question about ablation study

Hi, thanks for sharing your great work!

I'm very interested in exploring the application of LDM in virtual try-on and inspired by your work. But I'm confused by the second third row of Tab. 4 in your paper.

I notice the performance doesn't drop obviously with empty strings (row 1) or textual elements (row 2). How can I get textual elements? Maybe pass the garment images directly through VE to U-NET?

Moreover, why does the performance drop dramatically with f_theta? even much worse than empty strings?

Looking forward to your reply! Thank you again!

miccunifi / ladi-vton

Question about ablation study #46