ldkong1205/TranSVAE - Githubissues

Unsupervised Video Domain Adaptation for Action Recognition:
A Disentanglement Perspective

Pengfei Wei¹ Lingdong Kong^1,2 Xinghua Qu¹ Yi Ren¹ Zhiqiang Xu³ Jing Jiang⁴ Xiang Yin¹
¹ByteDance AI Lab ²National University of Singapore ³MBZUAI ⁴University of Technology Sydney

NeurIPS 2023

About

TranSVAE is a disentanglement framework designed for unsupervised video domain adaptation. It aims at disentangling the domain information from the data during the adaptation process. We consider the generation of cross-domain videos from two sets of latent factors: one encoding the static domain-related information and another encoding the temporal and semantic-related information. Objectives are enforced to constrain these latent factors to achieve domain disentanglement and transfer.

Col1: Original sequences ("Human" $\mathcal{D}=\mathbf{P}_1$ and "Alien" $\mathcal{D}=\mathbf{P}_2$); Col2: Sequence reconstructions; Col3: Reconstructed sequences using $z_1^{\mathcal{D}},...,z_T^{\mathcal{D}}$; Col4: Domain transferred sequences with exchanged $z_d^{\mathcal{D}}$.

Visit our project page to explore more details. :paw_prints:

Updates

[2023.10] - We provide our extracted I3D features, kindly refer to this page for more details.
[2023.09] - TranSVAE was accepted to NeurIPS 2023! :tada:
[2022.08] - TranSVAE achieves 1st place among the UDA leaderboards of UCF-HMDB, Jester, and Epic-Kitchens, based on Paper-with-Code.
[2022.08] - Try a Gradio demo for domain disentanglement in TranSVAE at Hugging Face Spaces! :hugs:
[2022.08] - Our paper is available on arXiv, click here to check it out!

Highlights

Conceptual Comparison

Graphical Model

Framework Overview

Installation

Please refer to INSTALL.md for the installation details.

Data Preparation

Please refer to DATA_PREPARE.md for the details to prepare the ¹UCF₁₀₁, ²HMDB₅₁, ³Jester, ⁴Epic-Kitchens, and ⁵Sprites datasets.

Getting Started

Please refer to GET_STARTED.md to learn more usage about this codebase.

Main Results

UCF₁₀₁ - HMDB₅₁

	Method	Backbone	U₁₀₁ → H₅₁	H₅₁ → U₁₀₁
DANN (JMLR'16)	ResNet-101	75.28	76.36	75.82
JAN (ICML'17)	ResNet-101	74.72	76.69	75.71
AdaBN (PR'18)	ResNet-101	72.22	77.41	74.82
MCD (CVPR'18)	ResNet-101	73.89	79.34	76.62
TA³N (ICCV'19)	ResNet-101	78.33	81.79	80.06
ABG (MM'20)	ResNet-101	79.17	85.11	82.14
TCoN (AAAI'20)	ResNet-101	87.22	89.14	88.18
MA²L-TD (WACV'22)	ResNet-101	85.00	86.59	85.80
Source-only	I3D	80.27	88.79	84.53
DANN (JMLR'16)	I3D	80.83	88.09	84.46
ADDA (CVPR'17)	I3D	79.17	88.44	83.81
TA³N (ICCV'19)	I3D	81.38	90.54	85.96
SAVA (ECCV'20)	I3D	82.22	91.24	86.73
CoMix (NeurIPS'21)	I3D	86.66	93.87	90.22
CO²A (WACV'22)	I3D	87.78	95.79	91.79
TranSVAE (Ours)	I3D	87.78	98.95	93.37
Oracle	I3D	95.00	96.85	95.93

Jester

	Task	Source-only	DANN	ADDA	TA³N	CoMix	TranSVAE (Ours)	Oracle
J_S → J_T	51.5	55.4	52.3	55.5	64.7	66.1	95.6

Epic-Kitchens

	Task	Source-only	DANN	ADDA	TA³N	CoMix	TranSVAE (Ours)
D₁ → D₂	32.8	37.7	35.4	34.2	42.9	50.5	64.0
D₁ → D₃	34.1	36.6	34.9	37.4	40.9	50.3	63.7
D₂ → D₁	35.4	38.3	36.3	40.9	38.6	50.3	57.0
D₂ → D₃	39.1	41.9	40.8	42.8	45.2	58.6	63.7
D₃ → D₁	34.6	38.8	36.1	39.9	42.3	48.0	57.0
D₃ → D₂	35.8	42.1	41.4	44.2	49.2	58.0	64.0
Average	35.3	39.2	37.4	39.9	43.2	52.6	61.5

Ablation Study

UCF₁₀₁ → HMDB₅₁

HMDB₅₁ → UCF₁₀₁

Domain Transfer Example	Source (Original)		Source (Original)


Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$)


Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$)


Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$)	Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$)	Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$)	Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$)


Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$)	Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$)

TODO List

[x] Initial release. 🚀
[x] Add license. See here for more details.
[x] Add demo at Hugging Face Spaces.
[x] Add installation details.
[x] Add data preparation details.
[x] Add evaluation details.
[x] Add training details.

License

This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgement

We acknowledge the use of the following public resources during the course of this work: ¹UCF₁₀₁, ²HMDB₅₁, ³Jester, ⁴Epic-Kitchens, ⁵Sprites, ⁶I3D, and ⁷TRN.

Citation

If you find this work helpful, please kindly consider citing our paper:

@inproceedings{wei2023transvae,
  title = {Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective},
  author = {Wei, Pengfei and Kong, Lingdong and Qu, Xinghua and Ren, Yi and Xu, Zhiqiang and Jiang, Jing and Yin, Xiang},
  booktitle = {Advances in Neural Information Processing Systems}, 
  year = {2023},
}