[idea] Text Caption to Segmentation Map Generator

Great work on this! I hope it's ok to drop an idea here.

I'm wondering if any existing model can generate a labelled segmentation map from a prompt?

You mentioned in Regional Prompter (6) MULAN and layerdiffusion... but maybe generating just a labelled segmentation mask from a prompt is possible today with minimal fine tuning?

To be useful, it'd need to be more coherent, meaningful, semantically correct, higher quality, and more plausible, otherwise it'd offer no advantage over a text to image diffusion model.

Building on the work of Segment Anything and Dino, https://github.com/IDEA-Research/Grounded-Segment-Anything as well as CogVLM it should at least be possible to prepare such a dataset and train a model / controlnet model although I realise that will be more expensive.

A hacky approach could be pre-generate 10+ low res images with minimal steps, run grounded segment anything over them, then pick the best one to use as the guidance for hires.

I'm sure you've already thought of this as you have already done the coloured depth map and listed 6 approaches!

lllyasviel / Omost

[idea] Text Caption to Segmentation Map Generator #83