Open kweonwooj opened 6 years ago
pretty cool notes.
i think to train it you just have 2 kind of losses:
1) normal auto-encoder loss 2) clustering loss, i.e. embed input z(x), compute the closest code-vector e, and minimize |z(x) - e|
In the paper they used 2 different loss for 2), one moves e toward z(x), and other moved z(x) toward e:
|sg[z(x)] - e|
and
|z(x) - sg[e]|
where you can think of stop gradient "sg" as turning z(x) into a constant.
should be straight forward in pytorch
I'm trying to implement this in tensorflow 2.0 but I've still got some doubts about the training phase. How can can the latent space (also called codebook in later papers) can be learnt if it is supposed to be discrete? They also propose to use an exponential moving average, which would generate a latent space which is no more discrete. I would like to see some code examples of this work
Isn't the loss function is
Abstract
Details
Introduction
Related Work
VQ-VAE
e ~ R^(K x D)
whereK
is the size of discrete latent space andD
is the dimensionality of each latent embedding vectore_i
q(z | x)
wherex
is an input data andz
is a latent variableq(z | x)
is defined as one-hot :e
is chosen via discretization bottleneckL2
error to move the embedding vectore_i
toward the encoder outputz_e(x)
sg
stands for stopgradient, where forward pass is identity and backward pass is zerobeta=0.25
, but values between 0.1 ~ 2.0 has no big impactExperiments
128 x 128 x 3
image into32 x 32 x 1
discrete latent space84 x 84 x 3
image ->21 x 21 x 1
latent spacex64
compression compared to original sound wavePersonal Thoughts
Link : https://arxiv.org/pdf/1711.00937.pdf Authors : van den Oord et al. 2017