Pengfei Wei1
Lingdong Kong1,2
Xinghua Qu1
Yi Ren1
Zhiqiang Xu3
Jing Jiang4
Xiang Yin1
1ByteDance AI Lab
2National University of Singapore
3MBZUAI
4University of Technology Sydney
TranSVAE is a disentanglement framework designed for unsupervised video domain adaptation. It aims at disentangling the domain information from the data during the adaptation process. We consider the generation of cross-domain videos from two sets of latent factors: one encoding the static domain-related information and another encoding the temporal and semantic-related information. Objectives are enforced to constrain these latent factors to achieve domain disentanglement and transfer.
Col1: Original sequences ("Human" $\mathcal{D}=\mathbf{P}_1$ and "Alien" $\mathcal{D}=\mathbf{P}_2$); Col2: Sequence reconstructions; Col3: Reconstructed sequences using $z_1^{\mathcal{D}},...,z_T^{\mathcal{D}}$; Col4: Domain transferred sequences with exchanged $z_d^{\mathcal{D}}$.
Visit our project page to explore more details. :paw_prints:
Conceptual Comparison |
---|
Graphical Model |
Framework Overview |
Please refer to INSTALL.md for the installation details.
Please refer to DATA_PREPARE.md for the details to prepare the 1UCF101, 2HMDB51, 3Jester, 4Epic-Kitchens, and 5Sprites datasets.
Please refer to GET_STARTED.md to learn more usage about this codebase.
Task | Source-only | DANN | ADDA | TA3N | CoMix | TranSVAE (Ours) | Oracle | |
---|---|---|---|---|---|---|---|---|
JS → JT | 51.5 | 55.4 | 52.3 | 55.5 | 64.7 | 66.1 | 95.6 |
UCF101 → HMDB51
HMDB51 → UCF101
Domain Transfer Example
|
Source (Original) | Target (Original) | Source (Original) | Target (Original) | |
---|---|---|---|---|---|
Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$) | ||
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$) | ||
Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$) | Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$) | ||
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$) | ||
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
We acknowledge the use of the following public resources during the course of this work: 1UCF101, 2HMDB51, 3Jester, 4Epic-Kitchens, 5Sprites, 6I3D, and 7TRN.
If you find this work helpful, please kindly consider citing our paper:
@inproceedings{wei2023transvae,
title = {Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective},
author = {Wei, Pengfei and Kong, Lingdong and Qu, Xinghua and Ren, Yi and Xu, Zhiqiang and Jiang, Jing and Yin, Xiang},
booktitle = {Advances in Neural Information Processing Systems},
year = {2023},
}