vvictoryuki / FreeDoM

[ICCV 2023] Official PyTorch implementation for the paper "FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model"
260 stars 9 forks source link

Questions about qualitative Results #6

Open dixiyao opened 1 year ago

dixiyao commented 1 year ago

Hi, really awesome work! I have read your paper and find that in Table1, you only compare your methods with TediGAN. But as you mentioned in your related work, there are other two better training required methods: ControlNet and T2I adapter. How's the FID and Clip score comparing those two works. In T2I adapter, on Coco dataset and text+sketch, the FID is 16.78. I also measured the FID of ControlNet on Coco with only 1k images, text+sketch, the FID is 6.09. But in Table 1 you have 70.97. So, I'm a little bit confused as your generation results are very good. But from the generated figures, I think your dataset is not Coco. As we are recently working on training efficiency and inference algorithms for ControlNet, if we can replace such training requested process, I think there will be no motivation to work on training efficiency algorithm for ControlNet. Thanks very much!

vvictoryuki commented 1 year ago

@dixiyao Thank you for recognizing and paying attention to our work! With regard to the issue you raised, I will attempt to address it from the following perspectives:

(1) Firstly, in Table 1, we compared results on aligned face datasets (such as FFHQ and CelebA-HQ). The unconditional diffusion model we used was also trained on these datasets, so the size and diversity of the training set may not be as good as datasets like COCO, which could partially explain why the FID values show large differences compared to the COCO dataset.

(2) We only presented examples of stylization and face-swapping using models related to latent diffusion (such as stable diffusion and ControlNet) because we found that these experimental settings (such as learning rate settings) are relatively easy to set and can yield satisfactory results. However, it is more difficult to handle fine-grained conditional information such as sketches in stable diffusion.

(3) We fully agree with your point that if we have control methods that do not require training, why should we use training-required methods like ControlNet? However, my current view is that although training-free methods have significant efficiency advantages, they also have significant limitations (such as the difficulty of using FreeDoM to control sketch conditions in stable diffusion, as mentioned in (2)). Therefore, I believe that the future trend may be a collaboration between training-free and training-required methods. For example, ControlNet can be used to precisely control sketch conditions, and FreeDoM can be used to efficiently control style information (since we were pleasantly surprised by the experimental results in this aspect). I think this is a reasonable expectation, which also indicates that our efforts and attempts in FreeDoM are valuable.

Finally, I hope these answers can help address your questions to some extent!

dixiyao commented 1 year ago

Thanks very much for your patient answering! I completely agree your view of combining training-free and training-required methods. I think both methods are task specific in some extent. For example, for tasks like face-swapping, if I want to train a condition with Elon Musk's face, I may need N training samples of his face. Though I have tried that ControlNet can achieve relatively good performance even with 100 training samples, it is very have to get 100 images of a single person's face. I'm not sure if my understanding is correct. But if that is the case, training-free shall be a better choice. While as you have said, for some tasks, we should use training-required methods.