Distillation questions - Githubissues

Bigfield77 commented 3 months ago

Hello,

 First, thank you for your work!

 I had some questions  about distillation:
 -Would a smaller model distilled from a larger model (eg 8b distilled from 33b) be as good/better than a model trained natively on the smaller size model?

-Does distillation preserve compatibility with Loras and ControlNets?

Thanks for your time!

tianweiy commented 3 months ago

Would a smaller model distilled from a larger model (eg 8b distilled from 33b) be as good/better than a model trained natively on the smaller size model?

We don't have much experience with this. But method wise, the generator's architecture is not tied to the teacher. So in theory, we can use smaller models for the generator. We can potentially also apply some pruning or quantization. One thing to note is that the smaller model should be initialized in some way (either the denoising loss or the ode regression loss in our v1) to make the training as smooth as the current setting.

Does distillation preserve compatibility with Loras and ControlNets?

This one is a hit or miss. For instance, the 4-step SDXL model currently seems to work well with T2IAdapter but okish with custom models. In real usage, I normally make control net as our teacher model and then distill multiple types of control into the distilled model using the current DMD loss. This one works super well. At the end of the day, it seems some training still helps (instead of just plugging in the adapter)

Bigfield77 commented 3 months ago

Thanks for your answers!

kmpartner commented 2 months ago

Hi, Thank you for this repository. Can you explain about controlNet distillation more detail?

In real usage, I normally make control net as our teacher model and then distill multiple types of control into the distilled model using the current DMD loss.

How can I use controlNet as teacher model and distill into distilled model? And how to train the controlNet? Is there any codes for this?

tianweiy commented 2 months ago

How can I use controlNet as teacher model and distill into distilled model? And how to train the controlNet? Is there any codes for this?

We first train the controlnet like the standard way (with original diffusion model). After that, we replace generator / real_score / fake_score with controlnet and train the generator with the current dmd loss and the fake_score with current denoising loss. All these three networks would get the extra conditioning input. It might be helpful to jointly train for both text to image generation and controlled generation for the best performance.

Unfortunately, it is based on an internal model that we can't release.

tianweiy / DMD2

Distillation questions #24