TOAST: Transfer Learning via Attention Steering

https://arxiv.org/abs/2305.15542 https://github.com/bfshi/TOAST

UC Berkeley, Microsoft Research arxiv, under reivew

Top-Down Attention Steering

Abstract

TOAST : Top-Down Attention Steering
Transfer learning을 사용하는 down-stream task에서 task에 목적에 맞도록 attention을 수행하는 기법
작동 원리
- Pre-trained backbone을 frozen 상태로 유지
- 출력에서 task-relevant한 feature를 선택
- 이를 모델에 다시 피드백 하여, task-specific feature에 attention
작은 수의 파라미터로 작동
다른 transfer learning 기법들과 비교시 우월한 성능을 보여줌 (full fine-tuning, LoRA, VPT)
- Fine-grained visual classification에서 탁월한 성능 보여줌 (81.1% → 86.2%)
language generation 모델 학습에 탁월

+other transfer learning method

LoRA: Low-Rank Adaptation of Large Language Models (ICLR 2022, Microsoft)

기존 pre-trained weight는 고정하고, 몇 개의 dense layer만 rank decomposition matrices를 최적화하는 방식으로 학습

기존 pre-trained weight W는 고정하고 low rank decomposition된 weight A,B만 학습시켜 W에 더해줌

VPT: Visual Prompt Tuning (ECCV 2022)

Pre-trained transformer는 frozen
입력 이미지에 특화된 prompt를 토큰화하여 입력에 추가
Head만을 training

TOAST: Top-down attention Steering

Preliminary: Transformer With Top-Down Attention

일반적인 transformer에서는 bottom-up attention을 수행
- 입력 신호의 모든 두드러진 특징을 강조함
일반적인 bottom-up transformer 구조에 feature selection module과 top-down feedback path를 추가

Step (i): bottom-up transformer (feed forward backbone)

Step (ii): Feature selection

task-relevant한 token과 channel을 선택

notation
- z_i는 d차원의 feature를 가지는 i번째 출력 토큰
- P는 (dxd)차원의 task specific parameter
- ξ 는 어떤 토큰이 중요한지 인코딩하는 임베딩
Token selection
- z_i와 ξ에 대한 cosine-similarity를 계산
Channel selection
- P를 이용하여 d차원에 대한 선형변환

Step (iii): FeedBack path

Feature selection된 토큰을 linear layer를 통과시켜 top-down 입력 생성

Step (iv): Self-attention with top-down input

2nd feedforward transformer의 self-attention 모듈의 입력 Value에 Step (iii)의 top-down 입력 추가

Top-down Attention Steering

TOAST를 위하여 두 단계로 모델을 학습

Stage 1. Pre-tuning
- top-down attention 모듈이 랜덤하게 초기화 되면, 성능이 하락할 여지가 있음
- 따라서, 일반적인 데이터셋으로 pre-tuning을 통해 top-down attention 모듈을 학습
Stage 2. Tuning
- Downstream task에 적합하도록 학습
- top-down attention 모듈만을 학습 (전체의 15% 정도의 파라미터만 업데이트 됨)
- 조정 가능한 파라미터의 수는 feedback layer에서 주로 발생
  - 토큰 feature의 차원 d가 커질수록 d x d 매트릭스 P가 매우 커짐

Experiments

Vision model
- ImageNet 데이터셋 이용 pre-tune
Language model
- OpenWebText 데이터셋의 일부 사용 pre-tune

Visual classification

Language generation

LLaMa 모델에 대해 TOAST 적용
Aplaca dataset, Vicuna dataset

TOAST가 더 풍부한 답변을 생성
Instruction Tuning에서도 LoRA 대비 좋은 성능을 보여줌

Different model architecture and tasks

Convnet

Convnet 백본에 대해서도 적용 가능
top-down attention 모듈에서 linear를 deconv 로 변경하여, bottom-up convolutional layer의 입력으로 사용
- Larger model

ViT-L 과 같이 큰 모델에 대해서도 TOAST가 효과가 있음
- Semantic segmentation

기존 feedback과 다르게 마지막 레이어에 대해 top-down attention 적용시 성능이 좋음
segmentation task의 경우, fully fine-tuning 성능이 제일 좋고, TOAST는 다른 transfer learning 기법보다는 성능이 좋음

Parameter-Efficient TOAST

tunable parameter 개수 대비 성능이 좋음

Limitation of TOAST

Computation overhead
- feedback 과정 때문에 FLOPs가 두배로 상승함

gwleeee / PaperReview

TOAST: Transfer Learning via Attention Steering #16