As I told in #125, I've made a revision of this project where a feed the model using the gray-scale channel of a dataset of images (similar but no equal to what DeepMind made with PixelCNN).
I modify the TextReader I made for the WaveNet text generation version (#117). Specifically: I made a ImageReader. So this reader now can load (using Pillow: PIL) image files, rescale them to for example 64x64 (because I have no GPU and no time to waste :P) (pic.resize((32,24), Image.ANTIALIAS)) and finally get the 8bit grayscale channel (using convert('L')).
I really love the theoretical idea after the generative model using dilated causal convolution layers!! The results are wonderful and easily extrapolated.
I made the next test with this image implementation:
1) I used a small dataset with 12 similar (but different) images of a person (the woman in the picture is my mother). She is posing in different ways.
Above one of the original pictures, and below the scaled gray-scale image fed into the training time:
2) I trained the model using a SAMPLE_SIZE = 4096 and a learning rate of 0.001
3) The loss dropped down gradually and only after 4k steps it was around ~0.09. I trained once again with learning rate = 0.0001, and after another 4k steps loss was around ~0.01
4) I generated the images using always a WINDOW = 4096 and a always 4096 samples (64x64 pixels so I can save a full image).
Well:
5) If a let the first sample as random:
waveform = np.random.randint(quantization_channels, size=(1,)).tolist()
or if I let the image grow using probability distribution and not argmax:
sample = np.random.choice(np.arange(quantization_channels), p=prediction)
I juts get ugly noise:
But, and this is interesting:
6) If I set the first sample to the first pixel in each image from the dataset. For example:
waveform = [169]
and if I use argmax to select every sample:
sample = np.argmax(prediction)
I can generate each and every of the posing images from the dataset!!
The model is wonderful, and it able to memorize a lot of information in its nodes!! I'm pretty sure something similar is what happens in our memory (in the brain). When we see an image, probably a "training" process will take place in some mini-neural network in our brain, and later when we need to remember that image (or text, or sound,...), another sub-network of our brain must to generate the memorized information and pass the data to other subnet of our brain.
Well. I hope this can help in anyway to the project we are trying to build in here.
I think I will try to implement the global conditioned WaveNet model and try to be able to select the image a want to recover just by passing the ID from the dataset image.
Doesn't it seem to be overfitted? 12 x 4096 only requires 64K floats and wavenet has much more parameters to fit in. If you can have similar result with hundreds of photos, it would seem to be meaningful.
As I told in #125, I've made a revision of this project where a feed the model using the gray-scale channel of a dataset of images (similar but no equal to what DeepMind made with PixelCNN).
I modify the TextReader I made for the WaveNet text generation version (#117). Specifically: I made a ImageReader. So this reader now can load (using Pillow: PIL) image files, rescale them to for example 64x64 (because I have no GPU and no time to waste :P) (pic.resize((32,24), Image.ANTIALIAS)) and finally get the 8bit grayscale channel (using convert('L')).
I really love the theoretical idea after the generative model using dilated causal convolution layers!! The results are wonderful and easily extrapolated.
I made the next test with this image implementation:
1) I used a small dataset with 12 similar (but different) images of a person (the woman in the picture is my mother). She is posing in different ways.
Above one of the original pictures, and below the scaled gray-scale image fed into the training time:
2) I trained the model using a SAMPLE_SIZE = 4096 and a learning rate of 0.001
3) The loss dropped down gradually and only after 4k steps it was around ~0.09. I trained once again with learning rate = 0.0001, and after another 4k steps loss was around ~0.01
4) I generated the images using always a WINDOW = 4096 and a always 4096 samples (64x64 pixels so I can save a full image).
Well:
5) If a let the first sample as random:
waveform = np.random.randint(quantization_channels, size=(1,)).tolist()
or if I let the image grow using probability distribution and not argmax:sample = np.random.choice(np.arange(quantization_channels), p=prediction)
I juts get ugly noise:
But, and this is interesting:
6) If I set the first sample to the first pixel in each image from the dataset. For example:
waveform = [169]
and if I use argmax to select every sample:sample = np.argmax(prediction)
I can generate each and every of the posing images from the dataset!!
The model is wonderful, and it able to memorize a lot of information in its nodes!! I'm pretty sure something similar is what happens in our memory (in the brain). When we see an image, probably a "training" process will take place in some mini-neural network in our brain, and later when we need to remember that image (or text, or sound,...), another sub-network of our brain must to generate the memorized information and pass the data to other subnet of our brain.
Well. I hope this can help in anyway to the project we are trying to build in here.
I think I will try to implement the global conditioned WaveNet model and try to be able to select the image a want to recover just by passing the ID from the dataset image.
If somebody want to play with this, here you can get the code: https://github.com/Zeta36/tensorflow-image-wavenet
Regards, Samu.