lllyasviel / ControlNet

Let us control diffusion models!
Apache License 2.0
29.94k stars 2.7k forks source link

promptless training #506

Closed gurkirt closed 1 year ago

gurkirt commented 1 year ago

I have pair of source and target images which are a little different from each other. I want to train a ControlNet to condition target generation with a given source image. But I do not have prompt. Would a prompt less or null prompt training work? E.g. generate new frame of a video given a frame.

Any help would be greatly appreciated Gurkirt

geroldmeisinger commented 1 year ago

yes, they mention this in the original paper and the github readmes, although the explanation is not very detailed. I collected some information here: https://civitai.com/articles/2078#heading-3617 -> Q: Why does a control net need captions or why would you drop captions?

what you are looking for has been done in video2video controlnet using optical flow, see here https://arxiv.org/abs/2307.14073 and AnimateDiff might also be of interest

geroldmeisinger commented 1 year ago

all duplicates about "dropping prompts" https://github.com/lllyasviel/ControlNet/issues/93 https://github.com/lllyasviel/ControlNet/issues/160 https://github.com/lllyasviel/ControlNet/issues/246 https://github.com/lllyasviel/ControlNet/issues/422 https://github.com/lllyasviel/ControlNet/issues/506

gurkirt commented 1 year ago

@geroldmeisinger, if I understand the link you shared above right, it says that you need prompt for at least 50% of training data, for anime even more and classifier-based guidance can only be used at test time, is that correct?

geroldmeisinger commented 1 year ago

you could also try BLIP to automatically add captions to your images (or look how they did in conceptual_captions). I didn't have any luck even with 50% drop only on 360k images(!), see my report on training a canny alternative here https://github.com/lllyasviel/ControlNet/discussions/318 and the evaluation images in .zip here https://huggingface.co/GeroldMeisinger/control-edgedrawing/tree/main . so I'd say it depends on your concept (see "anime argument") and how many images you have (360k wasn't enough for edge map based control net).

They way I understand prompt dropping is: if you are very explicit with your captions, you train the control net to strictly follow your prompts, which in inverse means, everything you didn't specify leaves little room for the control net to interprete freely. if you drop some prompts, you're training it to be more independent and let it fill out ambiguous areas with meaningful content by itself.

classifier-based guidance can only be used at test time

what do you mean by "test time"?

geroldmeisinger commented 1 year ago

you should also look into the "reference-only" control net. it's not a model but an "algorithmic" controlnet, so it didn't have an prompts during training, because there was no training. and it fits your task of "source and target image are similar". but from your initial post it appears you want to make a video controlnet. please do some research first, there are already some good approaches. animatediff is really hot right now (other keywords: VideoControlNet, termporalnet, temporalkit, warpfusion, etc. etc.). I also just got in "VideoDirectorGPT", but didn't look into it yet.

gurkirt commented 1 year ago

@geroldmeisinger thank your feedback, I understand there is relevant work there, and I will take a look into it. I have already looked into reference-only but doesn't fit in my project, however, some other suggested might. thanks - gurkirt