keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.46k forks source link

How to create a Variational Autoencoder (VAE) for images? #5084

Closed bernardohenz closed 7 years ago

bernardohenz commented 7 years ago

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

Hello, I am trying to create a Variational Autoencoder to work on images. The example on the repository shows an image as a one dimensional array, how can I modify the example to work, for instance, for images of shape =(none,3,64,64). I've tried to do so, without success, particularly on the Lambda layer:


def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(batch_size, 3,64,64)), mean=0.,
                              std=epsilon_std)
    return z_mean + K.exp(z_log_var / 2) * epsilon

z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])

What must be the output_shapeof the Lambda layer?

Thanks guys

joelthchao commented 7 years ago

Maybe I am wrong, but I think hidden layers is not necessary to be 2D. You can still input 2d images, use convolution and flatten layers to produce 1d representation (like any CNNs do), model by z, then reshape and deconvolution layers to decode to output image.

bernardohenz commented 7 years ago

Yeah, that should be my way out of this. I was wanting to avoid Dense, Flatten and Reshape layers, as this constrains the size of the input images. What I am saying is: if I only rely con Convolution layers, I can input (3,64,64) images, as well as (3,128,128) images, and this would work. But I don't know if it is possible to make a VAE in this way =/

joelthchao commented 7 years ago

Lambda layer's output_shape can be a function of input_shape, then it's possible to work in your situation.

patyork commented 7 years ago

You can train a Conv model that works on a certain image size (say, 32x32) and take subsamples/crops/strides of size 32x32 from larger images to feed through. E.g. run 4 32x32 crops of a 64x64 image (the quadrants of it), which will give you 4 outputs that you can stitch back together.

In addition, the DeConvolution layer might be something to play around with for image VAEs; it can be seen as a special applications of convolutions that increases the output size versus input size (e.g. a 12x12x64 input from a convolution layer can be deconv'd into something like a 16x16x32 output).

bstriner commented 7 years ago

The shape of your noise is exactly the shape of your z_mean and z_sigma, so you can use shape=K.shape(z_mean). For output_shape, just return the shape of the first input (or second input, it should be the same). I typically use a merge instead of a Lambda for this layer, but either should work.

Something like this should work for any input shape:

def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=K.shape(z_mean), mean=0., std=epsilon_std)
    return z_mean + K.exp(z_log_var / 2) * epsilon

z = Lambda(sampling, output_shape=lambda arg:arg[0])([z_mean, z_log_var])

Cheers