ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.29k forks source link

WaveNet implementation for image generation (based in this repository) #129

Open Zeta36 opened 8 years ago

Zeta36 commented 8 years ago

As I told in #125, I've made a revision of this project where a feed the model using the gray-scale channel of a dataset of images (similar but no equal to what DeepMind made with PixelCNN).

I modify the TextReader I made for the WaveNet text generation version (#117). Specifically: I made a ImageReader. So this reader now can load (using Pillow: PIL) image files, rescale them to for example 64x64 (because I have no GPU and no time to waste :P) (pic.resize((32,24), Image.ANTIALIAS)) and finally get the 8bit grayscale channel (using convert('L')).

I really love the theoretical idea after the generative model using dilated causal convolution layers!! The results are wonderful and easily extrapolated.

I made the next test with this image implementation:

1) I used a small dataset with 12 similar (but different) images of a person (the woman in the picture is my mother). She is posing in different ways. 20140211_202000

Above one of the original pictures, and below the scaled gray-scale image fed into the training time: example_train_file

2) I trained the model using a SAMPLE_SIZE = 4096 and a learning rate of 0.001

3) The loss dropped down gradually and only after 4k steps it was around ~0.09. I trained once again with learning rate = 0.0001, and after another 4k steps loss was around ~0.01

4) I generated the images using always a WINDOW = 4096 and a always 4096 samples (64x64 pixels so I can save a full image).

Well:

5) If a let the first sample as random: waveform = np.random.randint(quantization_channels, size=(1,)).tolist() or if I let the image grow using probability distribution and not argmax: sample = np.random.choice(np.arange(quantization_channels), p=prediction)

I juts get ugly noise: mam3000_06 mam3500_02

But, and this is interesting:

6) If I set the first sample to the first pixel in each image from the dataset. For example: waveform = [169] and if I use argmax to select every sample: sample = np.argmax(prediction)

I can generate each and every of the posing images from the dataset!! mam9000_0 01 mam9000_v2_0 01 mam9000_v3 mam9000_v4_ mam9000_v5 mam9000_v6 mam9000_v7

The model is wonderful, and it able to memorize a lot of information in its nodes!! I'm pretty sure something similar is what happens in our memory (in the brain). When we see an image, probably a "training" process will take place in some mini-neural network in our brain, and later when we need to remember that image (or text, or sound,...), another sub-network of our brain must to generate the memorized information and pass the data to other subnet of our brain.

Well. I hope this can help in anyway to the project we are trying to build in here.

I think I will try to implement the global conditioned WaveNet model and try to be able to select the image a want to recover just by passing the ID from the dataset image.

If somebody want to play with this, here you can get the code: https://github.com/Zeta36/tensorflow-image-wavenet

Regards, Samu.

nakosung commented 8 years ago

Doesn't it seem to be overfitted? 12 x 4096 only requires 64K floats and wavenet has much more parameters to fit in. If you can have similar result with hundreds of photos, it would seem to be meaningful.