gligen / GLIGEN

Open-Set Grounded Text-to-Image Generation
MIT License
1.92k stars 144 forks source link

More of a thought than an issue #1

Open Shikamaru5 opened 1 year ago

Shikamaru5 commented 1 year ago

If I'm understanding correctly you're suggesting training a new model with an added layer or conditional NN off the back of a pretrained ancestor model. What I'm wondering is why use the pretrained model, if you're training a model anyway w/ new or the same data why not start fresh, just having a bounding box layer.

Another thing that I was wondering is do you think it may be possible to have a layer in which you give almost explicit general rules for things, such as humans have only five fingers, or similar. I had considered doing this but instead of using bounding boxes, I'd have explicitly stated rules such as pay attention to nouns and adjectives, or follow sentence structure to determine the directive of the prompt. I even wrote up a list of the types of rules that may be applied in such a type of model.

How many bounding boxes does it generally use? For example could you get it to put bounding boxes around pretty much everything down to individual fingers or eyes?

Or is that the reason for the pretrained model, that it has knowledge of what certain things are and it can then label the bounding boxes instead of having to manually do it with millions of images. If that's the case it'd be interesting if it could be applied using a similar technique to create a service that annotates images for any dataset to augment training other people's models.

This is really fascinating work though, I'm excited to see where it can go, and thanks for letting me read about it with the paper and rant a little.

lychees commented 1 year ago

Is that possible to make it as a extention for SD WebUI?

Shikamaru5 commented 1 year ago

Well if I understood the paper correctly it's training a new model using the image2text generator as a backbone so if you were training a new model with a SD model or any other, and then creating your own WebUI with the new model I'd imagine so. Essentially you'd be inserting this technique into the training process so that it learns to differentiate specific parts of an image so that when you prompt the model it won't ignore a great deal of the prompt. Trouble is, training a good quality model requires a great deal of compute or specialized setup for your process. I've been working on using techniques such as DeepSpeed or ColossalAI in order to make my GPU optimize better because I only have the one GPU. I hope that was a sufficient enough answer, or perhaps I am missing details on what you're trying to accomplish?

ethansmith2000 commented 1 year ago

@Shikamaru5 I think starting from scratch would do well too. This paper is definitely one of the more impressive hacks i've seen of SD and overall i think its a step in the right direcction for image gen, making use of grounding boxes for composition.

but i think part of the impressive part of the paper is that you could train a smaller layer and fuse it. I would imagine the net compute for that is far lower than what was required for SD, and its a pretty cool trick overall to extend a model's capabilities by that much without having to train from scratch.

Shikamaru5 commented 1 year ago

@ethansmith2000 I agree it's definitely a really interesting approach, I feel like this sort of thing is where the more impactful parts of the field are headed.