lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
MIT License
8.09k stars 768 forks source link

how to train on varied aspect ratios or images smaller than Unet #147

Open Birch-san opened 2 years ago

Birch-san commented 2 years ago

Hey, thanks as always for your work on imagen-pytorch. 🙂

What's a good way to handle aspect ratio? The Booru-chars dataset I'd like to train from, has full body images from a variety of aspect ratios.

One idea is to zoom in (for example on face) until it fits into 1x1 aspect ratio.
Possible if you're okay with losing legs, and running face detection.

Another idea is to call aspect ratio an "augmentation", and just ensure that the text caption describes it.
Possible if text encoder supports the variety of aspect ratios you're expecting. Booru-chars is sharded by approximate aspect ratio, so this is sort of possible.

Other ways are padding, stretching. With some worry about what the model would learn from that.

But I wondered whether there's a way to simply tell imagen-pytorch not to learn from certain pixels, by passing in a mask?

This would also be useful for supporting data that is smaller than the largest Unet. For example it'd be nice to teach it frames from Walfie gifs, but they're (for example) 312x312 so could be smaller than a 512x512 Unet. It's easy to pad them, but training would be cheaper if I could tell it that padding pixels can be ignored.

Breeze-Zero commented 2 years ago

I also have this requirement. At present, I am facing a data set of 1536×320, and the size of each image may be inconsistent (floating in this ratio). Because it is not a natural image, it may not be resized directly. When I want to use Imagen, the first thing I have to solve is the aspect ratio inconsistency and the image size is larger than 1024. Hopefully that will be solved.