[31] POP: Prompt Of Prompts for Continual Learning

Abstract

Continual Learning의 컨셉은 catastrophic forgetting 없이 학습하는 것인데, 여전히 기존 연구들은 학습된 feature space 상에서 semantic drift가 발생하는 경향이 있음.
최근 연구에서는 representation의 generalilty를 해치지 않는 prompt tuning을 통해 specific task를 풀 수 있음을 보여주었음.
하지만 open question은 task specific, global prompt를 모두 학습하는 것. (i.e. capture cross-task information)
본 연구에서는 group of task-specific prompts 와 global prompts를 progressive 하게 학습하는 POP 모델을 제안함.

Introduction

기존 continual learning 연구들은 여전히 forgetting에 취약하고 task 수가 증가함에 따라 메모리와 시간 복잡성이 무한히 증가할 수 있으며 모든 task를 joint training 할 경우 substantial gap을 보일 수 있음.
foundation model은 prompt tuning 기술을 적용함으로써 적은 수의 파라미터만 학습하면 되기 때문에 specific tasks에 효율적으로 adapt 될 수 있음. 따라서 generalizable feature representation with great potential for few-shot learning이 가능해짐.
본 연구에서는 2개의 prompts로 구성되어 있는 POP을 제안함.
- 첫번째는 task prompts P_t로, task t개마다 학습되고 frozen되어서 foundation model이 tas의 class 간의 local한 정보를 discriminate 할 수 있는 prompt
- 두번째는 모든 task에 대해 continual하게 학습하는 부분으로, 모델이 모든 task의 모든 class 간 global한 정보를 discriminate할 수 있게 하는 prompt

Related work

knowledge transfer 방법은 이전 task에서 학습한 정보가 new task에 덮어씌워지면서 이전 task knowledge를 까먹을 수 있음
기존 contrastive learning 중 Network expansion, parameter isolation 등의 방법은 task가 추가될 때마다 모델의 크기가 빠르게 커진다는 것임.

Method

figure 설명
- embedding E는 patch를 token으로 매핑하며, prompt set을 학습해가면서 complemented 됨.
- step t에선s Pt와 POP만 업데이트 됨.
- 이러한 token은 foundation model에 feed 되며 RPt아 RPOP(task 별, 전체 task의 representation)을 얻게 됨
- 최종적으로는 RPOP을 평균내서 RPt와 concat하는 방식으로 prompt들을 fuse하고 cross-task feature, task-specific feature를 얻게 됨.
build-up 과정
- Shallow Prompt Tuning: 모델의 input layer에 prompt 두는 방식
- Deep Prompt Tuning: 모델의 여러 layer에 prompt 두는 방식.
- Prompt tuning의 가장 큰 장점이라고 볼 수 있는 부분은, A로 학습된 모델을 B에 adaptation 시키기 위해 B로 fine-tuning 하게 될 경우 A에서 학습한 representation이 파괴되고 robustness, generality가 모두 떨어지게 됨. 반면 prompt tuning은 모델의 size와 adaptation 간의 더 나은 trade-off를 찾을 수 있음.
- foundational feature concatenation 방법은 각 task에 대한 prompt set을 학습하고 메모리 버퍼에 저장해두고, 이러한 prompt를 잘 결합해서 transformer 모델을 모든 task에 adaptation 시키는 것임. 하지만 이런 방법의 경우 이전 task에서 new task로 knowledge를 transfer할 때 주로 특정 task의 데이터로 학습되기 때문에 서로 다른 task representation 간 상당한 redundancy가 있을 수 있다.
proposal
- 본 연구에서는 t시점만이 아니라 이전에 학습한 i<t의 prompt set을 고려할 수 있도록 Pt prompt set을 sequential하게 학습하는 방법을 제안함.
- t step에서는 i<t prompt sets이 frozen 되고 Pt부분이 learnable한 prompt가 되어 모델에 feed 됨.
- 이렇게 설계하면 overall model adaptation이 시간에 따라 증가하게 되며 RPt는 이전 task에서 학습한 모든 특징을 Transformer attention을 통해 reuse할 수 있고, 때문에 new task의 novel한 부분을 포착해서 adaptation을 specializate할 수 있음.
- 그렇다면 각각 다른 task에서 학습한 표현을 어떻게 결합하나?
- 모델에 feed 될 때 attention layer를 통해 x와 pt간의 연관성이 이미 고려가 되었기 때문에 이전 task representation 중에 다시 mix 하게 됨. (이중으로)
- 이를 피하기 위해 task 간 information을 integrate하는 방식을 제안한다.
- 모든 task의 prompt에 추가적인 prompt group을 학습하는 방식이고 이를 prompt for prompts 라고 지칭하고 POP이라고 부름
- POP sets은 모든 task에서 continually learned 되서 information을 integrate한다.
- 최종적으로는 POP끼리 평균내서 하나의 feature로 만든 뒤 task specific representation이랑 concat함

Objective Loss

RPOP이 모든 task의 모든 class를 구별하도록 해주는 CIL loss, 각 task RPt가 task t의 class를 구별하도록 해주는 auxiliary loss를 씀. 최종적으로 구성된 prompt set Pt는 task t에 속하지 않는 class를 하나의 (none or obove) 카테고리로 분류(reject느낌)하게 하면서 class 간 구별을 하게 함. 그다음은 task , class indentity loss임 최종 loss는 세 loss의 가중평균임.

Strength

기존 continual learning 연구의 한계를 보완하기 위해 task specific, global prompt를 모두 학습하는 방식을 제안함.

weakness

task specific prompt, global prompt를 따로 두는 concept 자체는 이미 dualprompt에서도 제안되었기 때문에 그렇게 novel해보이지 않음.
그렇다면 다른 점은 feature fusing 방식인데, 사실상 feature fusing 방식이 너무너무 간단함. (averaging -> concat) 실험에서 다양한 fusing methods를 비교해서 보여준 바 있으나, 그럼에도 방법 자체가 너무 간단해서 여전히 novelty 확보가 어려워보임.

sy00n / DL_paper_review

[31] POP: Prompt Of Prompts for Continual Learning #36