ldkong1205 / TranSVAE

[NeurIPS 2023] Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective
121 stars 11 forks source link
action-recognition domain-adaptation generative-model sequential-vae transfer-learning video-recognition

Unsupervised Video Domain Adaptation for Action Recognition:
A Disentanglement Perspective

Pengfei Wei1   Lingdong Kong1,2   Xinghua Qu1   Yi Ren1   Zhiqiang Xu3   Jing Jiang4   Xiang Yin1
1ByteDance AI Lab   2National University of Singapore   3MBZUAI   4University of Technology Sydney

NeurIPS 2023


TranSVAE is a disentanglement framework designed for unsupervised video domain adaptation. It aims at disentangling the domain information from the data during the adaptation process. We consider the generation of cross-domain videos from two sets of latent factors: one encoding the static domain-related information and another encoding the temporal and semantic-related information. Objectives are enforced to constrain these latent factors to achieve domain disentanglement and transfer.

Col1: Original sequences ("Human" $\mathcal{D}=\mathbf{P}_1$ and "Alien" $\mathcal{D}=\mathbf{P}_2$); Col2: Sequence reconstructions; Col3: Reconstructed sequences using $z_1^{\mathcal{D}},...,z_T^{\mathcal{D}}$; Col4: Domain transferred sequences with exchanged $z_d^{\mathcal{D}}$.

Visit our project page to explore more details. :paw_prints:




Conceptual Comparison
Graphical Model
Framework Overview


Please refer to INSTALL.md for the installation details.

Data Preparation

Please refer to DATA_PREPARE.md for the details to prepare the 1UCF101, 2HMDB51, 3Jester, 4Epic-Kitchens, and 5Sprites datasets.

Getting Started

Please refer to GET_STARTED.md to learn more usage about this codebase.

Main Results

UCF101 - HMDB51

PWC Method Backbone U101 → H51 H51 → U101 Average
DANN (JMLR'16) ResNet-101 75.28 76.36 75.82
JAN (ICML'17) ResNet-101 74.72 76.69 75.71
AdaBN (PR'18) ResNet-101 72.22 77.41 74.82
MCD (CVPR'18) ResNet-101 73.89 79.34 76.62
TA3N (ICCV'19) ResNet-101 78.33 81.79 80.06
ABG (MM'20) ResNet-101 79.17 85.11 82.14
TCoN (AAAI'20) ResNet-101 87.22 89.14 88.18
MA2L-TD (WACV'22) ResNet-101 85.00 86.59 85.80
Source-only I3D 80.27 88.79 84.53
DANN (JMLR'16) I3D 80.83 88.09 84.46
ADDA (CVPR'17) I3D 79.17 88.44 83.81
TA3N (ICCV'19) I3D 81.38 90.54 85.96
SAVA (ECCV'20) I3D 82.22 91.24 86.73
CoMix (NeurIPS'21) I3D 86.66 93.87 90.22
CO2A (WACV'22) I3D 87.78 95.79 91.79
TranSVAE (Ours) I3D 87.78 98.95 93.37
Oracle I3D 95.00 96.85 95.93


PWC Task Source-only DANN ADDA TA3N CoMix TranSVAE (Ours) Oracle
JSJT 51.5 55.4 52.3 55.5 64.7 66.1 95.6


PWC Task Source-only DANN ADDA TA3N CoMix TranSVAE (Ours) Oracle
D1D2 32.8 37.7 35.4 34.2 42.9 50.5 64.0
D1D3 34.1 36.6 34.9 37.4 40.9 50.3 63.7
D2D1 35.4 38.3 36.3 40.9 38.6 50.3 57.0
D2D3 39.1 41.9 40.8 42.8 45.2 58.6 63.7
D3D1 34.6 38.8 36.1 39.9 42.3 48.0 57.0
D3D2 35.8 42.1 41.4 44.2 49.2 58.0 64.0
Average 35.3 39.2 37.4 39.9 43.2 52.6 61.5

Ablation Study



Domain Transfer Example
Source (Original) Target (Original) Source (Original) Target (Original)
src_original tar_original src_original tar_original
Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$)
src_recon tar_recon src_recon tar_recon
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$) Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$)
recon_srcZf recon_tarZf recon_srcZf recon_tarZf
Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$) Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$)
recon_srcZt recon_tarZt recon_srcZt recon_tarZt
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$)
recon_srcZf_tarZt recon_tarZf_srcZt recon_srcZf_tarZt recon_tarZf_srcZt



Creative Commons License
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


We acknowledge the use of the following public resources during the course of this work: 1UCF101, 2HMDB51, 3Jester, 4Epic-Kitchens, 5Sprites, 6I3D, and 7TRN.


If you find this work helpful, please kindly consider citing our paper:

  title = {Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective},
  author = {Wei, Pengfei and Kong, Lingdong and Qu, Xinghua and Ren, Yi and Xu, Zhiqiang and Jiang, Jing and Yin, Xiang},
  booktitle = {Advances in Neural Information Processing Systems}, 
  year = {2023},