[16] Out-Of-Distribution Representation Learning for Time Series Classification

논문을 읽기 전 정리해야 할 개념: Domain Generalization

이미지에서의 Domain Generalization

우선 domain 이라는 것은 image의 distribution이 다 다른 것을 의미한다. 보통은 학습 데이터와 테스트 데이터가 같은 분포라는 가정이 있다. 하지만 위 상황처럼 네 개의 그림이 각각 다른 style을 가지고 있지만 모두 같은 강아지 라는 라벨을 갖는 경우가 있다. 여기서 하나의 style로 학습시키고 다른 style의 강아지를 예측하려고 하면 잘 안된다. 따라서 여러 도메인에 걸쳐 잘 학습시키는 것을 domain generalization이라고 한다.

시계열에서의 Domain generalization 이란?

우선 시계열에서의 domain은 latent distribution을 의미한다.
시계열은 수많은 unknown latent distributions(domains)를 가질 수 있다. 예를 들어 세 명의 사람으로부터 수집된 센서는 서로 다른 distribution을 가질 수 있고 이를 spatial distribution shift라고 부른다.
그리고 한 사람으로부터의 데이터여도 시간에 따라 변할 수 있고 이를 temporal distribution shift 라고 부른다.

Abstract
In this paper, we propose to view time series classification from the distribution perspective
We argue that the temporal complexity of a time series dataset could attribute to unknown latent distributions that need characterize.
To this end, we propose DIVERSITY for out-of-distribution (OOD) representation learning on dynamic distributions of times series.

Introduction

본 논문에서는 dynamically changing distribution을 다루기 위해 distribution 측면에서 OOD representation Learning algorithm을 다룬다. "DIVERSITY"(다양성이 아니라 알고리즘 이름)는 min-max adversarial game 방식을 적용을 했는데, Max부분에서는 diversity를 보존하면서 segment-wise distribution gap을 최대화함으로써 time series data를 여러개의 latent sub-domain으로 나누는 것을 학습한다(worst-case를 학습하는것임) .반대로 Min부분은 latent domain간의 distribution divergence를 최소화하는 방향으로 domain-invariant representation을 학습한다.

What are domain and distribution shift in time series?

예를 들어 세명의 다른 사람들로부터 각각 측정된 센서 데이터는 dissimilarities에 의해 다른 분포를 따를 수 있고 이를 spatial distribution shift 라고 부른다. 또 한 사람이 측정했을지라도 temporal distribution shift가존재할 수 있다.(실험적으로도 증명함)

OOD generalization requires latent domain characterization

비정상성 측면 때문에 나이브한 접근들은 time series data에 대해 one distribution으로 접근하는데, 이렇게 되면 데이터셋 내부의 diversities를 무시하게 되어 domain-invariant(OOD)feature를 포착하는 데 실패한다. fig1의 c를 보면 알 수 있음.

A brief formulation of latent domain characterization

time series data는 (1개의 고정된게 아닌) K개의 unknown latent domain을 가질 수 있다. 학습의 목표는 maximize하는 상황에서 worst-case distribution senario를 학습하는 것이다. (각각의 latent domain 사이의 다양한 정보를 최대한 보존하는 방향이기 때문)

전체적인 프로세스

크게 4단계로 요약한다.

Pre-processing: sliding window를 적용해서 전체 학습 데이터를 고정된 크기의 윈도우로 자른다. 이 하나의 윈도우를 smallest domain unit으로 여긴다.
Fine-grained feature update: 이 단계에서는 feature extractor 학습 스텝을 업데이트하는데 pesudo domain-class label을 supervision으로 둬서 업데이트한다.
Latent distribution characterization(Maximization): domain label을 구분하도록 학습한다. diversity를 극대화하기 위해서 distribution gap을 최대화한다.
Domain-invariant representation learning(Minimization): 2에서 만든 pseudo domain label을 사용해서 domain-invariant representations를 학습하고 generalizable model을 학습한다.

2단계부터 좀더 자세히 설명하면

Fine-grained Feature Update

새로운 컨셉인 pseudo domain-class label를 둬서 domain/class level의 knowledge를 사용해서 feature extractor의 supervision으로 사용한다. 그냥 domain이나 label로만 했을 때 보다 둘 다 했을 때 더 fine-grained하다. (뒤에 ablation에서 결과적으로도 확인할 수 있음)

처음 iteration에서는 일단 domain label d'를 0으로 모두 초기화한다. 그리고 per category per domain S를 새로운 클래스로 두는데, S=K C 이다. (K는 하이퍼파라미터로, pre-defined number of latent distribution 즉 사전 정의한 도메인 갯수를 의미함. C는 라벨을 의미함. ) s = d'C+y로 supervision을 줘서 pseudo domain-class assignment를 수행한다.

Latent Distribution Characterization

adversarial training을 거쳐서 class label로부터 domain label을 구분한다. 하지만 초기에는 domain label이 없기 때문에 self-supervised pseudo-labeling strategy로 domain label을 얻는다. 이 때 self-sup 방법은 DeepClustering(ECCV'18) 방법롡을 차용했다.

Domain-invariant Representation Learning

latent distribution을 얻고 나면, 일반화를 위해 domain-invariant representations을 학습한다. DANN에서 아이디어를 차용해서 Gradient Reversal Layer(GRL)을 이용한 adversarial training을 통해 classification loss와 domain classifier loss를 동시에 업데이트 한다.

Training, Inference, Complexity

일반적인 학습 방법과의 차이는 마지막 2 steop에서 마지막 few개 layer만 최적화한다. 마지막 step에서는 inference를 수행한다.

슬라이드2 슬라이드3 슬라이드4 슬라이드5 슬라이드6 슬라이드7 슬라이드8 슬라이드9 슬라이드10 슬라이드11 슬라이드12 슬라이드13 슬라이드14 슬라이드15 슬라이드16 슬라이드17 슬라이드18 슬라이드19 슬라이드20 슬라이드21 슬라이드22 슬라이드23 슬라이드24 슬라이드25 슬라이드26 슬라이드27 슬라이드28 슬라이드29