Vector QuantisedVariational AutoEncoder (VQ-VAE)를 소개한 논문.
이산 잠재 표현(discrete latent representation)을 학습하기 의해 vector quantization의 아이디어를 사용했다.
Abstract
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector QuantisedVariational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of “posterior collapse” - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Keywords
VAE, Autoencoder, unsupervised learning
TL;DR
Vector QuantisedVariational AutoEncoder (VQ-VAE)를 소개한 논문. 이산 잠재 표현(discrete latent representation)을 학습하기 의해 vector quantization의 아이디어를 사용했다.
Abstract
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector QuantisedVariational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of “posterior collapse” - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Paper link
https://arxiv.org/abs/1711.00937
Presentation link
https://github.com/isingmodel/TIL/blob/main/2022/05_07_Neural_Discrete_Representation_Learning/Neural%20Discrete%20Representation%20Learning.pdf
video link
https://youtu.be/tF1WSN-11PQ