Open KaruroChori opened 1 year ago
Edit: this was a reply to a post later removed.
In theory it would be easy to automatically generate images for training. Blender has a good coverage with python APIs and after some initial setup all steps can be automatic. We could prepare a small but diverse set of scenes (there are many made available by the blender foundation). For each of them we need to set up many vectors for the camera to be moved and aligned to. We enable the two passes we are interested in + the normal full rendering, and profit.
Edit: this was a reply to a post later removed.
In theory it would be easy to automatically generate images for training. Blender has a good coverage with python APIs and after some initial setup all steps can be automatic. We could prepare a small but diverse set of scenes (there are many made available by the blender foundation). For each of them we need to set up many vectors for the camera to be moved and aligned to. We enable the two passes we are interested in + the normal full rendering, and profit.
All we need is 200k of these examples.
512x512? It is feasible, even more so with few cards supporting OptiX. Actually eevee got support for Cryptomatte few years ago, so we could avoid Cycles and speed up the rendering process quite a bit.
The main concern would be tagging the final images.
512x512? It is feasible, even more so with few cards supporting OptiX. Actually eevee got support for Cryptomatte few years ago, so we could avoid Cycles.
The main concern would be tagging the final images.
Automated captioning by BLIP or whatever?
I don't have any experience with it, but its seems good from what I have seen.
Wouldn't it be possible to have a more detailed caption? "FF0000: gray car, 00FF00: glass, 0000FF: parking lot"
Basically a material list exported from blender with at least albedo and the material label? I do not have access to my main workstation at the moment, but next week I would like to see what is feasible in this respect. Also, we need to cope with the limitations of the text model used by stable diffusion, and I am not sure this is too easy.
also see "double control" discussion here https://github.com/lllyasviel/ControlNet/discussions/30
It would be great if we could use cryptomatte and depth passes generated from a rendering engine i.e. blender, and to use their combined information to inform the final "rendering" via controlnet. This would be somewhat similar to a combination of depth and segmentation maps as they are currently implemented.