T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

Code: https://mael-zys.github.io/T2M-GPT/ paper link : https://arxiv.org/abs/2301.06052

Architecture
1. Motion VQ-VAE: learns a mapping between motion data and discrete code sequences
2. T2M-GPT：generates code indices conditioned on the text description
3. With the decoder in Motion VQ-VAE, to recover the motion from the code indices.
4. CLIP：to extract text embedding c
5. text-to-motion generation can be formulated as an autoregressive next-index prediction
6. Casual attention with mask: future information is not allowed to attend the calculation of current tokens. e.g. mask = torch.tril(torch.ones(size, size)), mask = mask.masked_fill(mask == 0, float('-inf'))
Loss sg: The stop-gradient operator prevents gradient updates during backpropagation, stabilizing codebook in VQ-VAE Reconstruction loss:

loss_part1 = smooth_l1_loss(X, X_re), loss_part2 = smooth_l1_loss(V(X), V(X_re)) commitment_loss = torch.mean((encoded - stop_gradient(quantized)) ** 2) （commitment_loss = torch.mean((encoded - quantized.detach()) 2)） `quantization_loss = torch.mean((stop_gradient(encoded) - quantized) 2)`

Quantization strategy： exponential moving average (EMA) and codebook reset (Code Reset)
Corrupted sequences for the training-testing discrepancy:

ouusan / some-papers

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations #12