openai / glide-text2im

GLIDE: a diffusion-based text-conditional image synthesis model
MIT License
3.53k stars 500 forks source link

Higher Resolution #13

Closed jamahun closed 2 years ago

jamahun commented 2 years ago

Is there a way to upsize the outputs to something closer to 1024px? I've noticed a few people on twitter that have been able to do so with this model but after trying to change the image size to a higher value I get this error for anything over 256 -

/usr/local/lib/python3.7/dist-packages/glide_text2im/model_creation.py in create_model(image_size, num_channels, num_res_blocks, channel_mult, attention_resolutions, num_heads, num_head_channels, num_heads_upsample, use_scale_shift_norm, dropout, text_ctx, xf_width, xf_layers, xf_heads, xf_final_ln, xf_padding, resblock_updown, use_fp16, cache_text_emb, inpaint, super_res) 140 channel_mult = (1, 2, 3, 4) 141 else: --> 142 raise ValueError(f"unsupported image size: {image_size}") 143 else: 144 channel_mult = tuple(int(ch_mult) for ch_mult in channel_mult.split(",")) ValueError: unsupported image size: 1024

woctezuma commented 2 years ago

The error happens at line 142 because channel_mult is "" and image_size is not one of 256, 128, 64.

https://github.com/openai/glide-text2im/blob/9cc8e563851bd38f5ddb3e305127192cb0f02f5c/glide_text2im/model_creation.py#L134-L145

Maybe try to set channel_mult to a non-empty string.

jamahun commented 2 years ago

Thanks so much for your response @woctezuma I did see this part of the script but didn’t really know how I could change it. I’ve got a very basic understanding of python and prgramming languages in general any chance you could give me an example of how I can change the channel_mult to a non-empty string?

woctezuma commented 2 years ago

I don't know:

However, I can make a few remarks about values which would work with the else statement and pass the assert check.

First, the string should be a sequence of integers separated by commas. For instance: channel_mult = "1,2,3,4" would be correctly parsed and transformed into (1, 2, 3, 4)

Second, the length of the string should be equal to log2(image_size) - 2. For instance, for an image resolution of 64, the length of the tuple is log2(64) - 2 = 6 - 2 = 4. This is consistent with the tuple mentioned above, i.e. (1, 2, 3, 4).

For an image resolution of 1024, the length of the tuple should be log2(1024) - 2 = 10 - 2 = 8.

Personally, based on the examples mentioned above, I would try one of the following tuples:

That would require an input string:

unixpickle commented 2 years ago

We haven't trained an upsampler for higher resolutions. You can't just change the code and load the old upsampler model--it won't work because it was trained for 256x256.

People on Twitter have been using third-party upsamplers / image super resolution models.

dqj5182 commented 3 months ago

Any recommendations on third-party upsamplers / image super resolution models?

woctezuma commented 3 months ago

Any recommendations on third-party upsamplers / image super resolution models?

Maybe https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler from last year.