Motion VQ-VAE: learns a mapping between motion data and discrete code sequences
T2M-GPT:generates code indices conditioned on the text description
With the decoder in Motion VQ-VAE, to recover the motion from the code indices.
CLIP:to extract text embedding c
text-to-motion generation can be formulated as an autoregressive next-index prediction
Casual attention with mask: future information is not allowed to attend the calculation of current tokens.
e.g. mask = torch.tril(torch.ones(size, size)), mask = mask.masked_fill(mask == 0, float('-inf'))
Loss
sg: The stop-gradient operator prevents gradient updates during backpropagation, stabilizing codebook in VQ-VAE
Reconstruction loss:
Code: https://mael-zys.github.io/T2M-GPT/ paper link : https://arxiv.org/abs/2301.06052
Architecture
mask = torch.tril(torch.ones(size, size)), mask = mask.masked_fill(mask == 0, float('-inf'))
Loss sg: The stop-gradient operator prevents gradient updates during backpropagation, stabilizing codebook in VQ-VAE Reconstruction loss:
loss_part1 = smooth_l1_loss(X, X_re), loss_part2 = smooth_l1_loss(V(X), V(X_re))
commitment_loss = torch.mean((encoded - stop_gradient(quantized)) ** 2)
(commitment_loss = torch.mean((encoded - quantized.detach()) 2)) `quantization_loss = torch.mean((stop_gradient(encoded) - quantized) 2)`