Recent Advances in LM Fine-tuning

ruder 블로그 글 읽기 (https://ruder.io/recent-advances-lm-fine-tuning/)

Adaptive finetuning
- = PLM을 가져와서 domain-specific data에 fine-tuning (unlabelled)
- 심지어는 task data에 대해서 multi-task learning으로도 가능
- adapting to data of the target domain and target task are complementary
- adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining
- Single-domain의 여러 task에 대해 적용해야 할때 유용하다고 함
- general PLM -> domain-adaptive pretraining -> task-adaptive pretraining -> task train
Behavioural finetuning
- = intermediate task에 대해서 finetuning 하는 것 (labelled data)
- MLM may provide useful information for learning P(Y|X)P(Y|X) but likely does not contain every signal important for the task. models pre-trained with MLM struggle with modelling negations, numbers, or named entities
Parameter-efficient fine-tuning
- = 모든 task에 대해 finetune한 모델을 다 들고 있으면 너무 비싸니까 대부분의 모델 parameter를 fix하고 task마다 조금의 param만 finetune하자.
- adapter = PLM의 레이어 사이사이에 들어가는 작은 bottleneck layer. (PLM은 고정하고)
- 직접 PLM의 parameter를 바꾸는 대신 $\theta{finetune} = \theta{pretrained} + \theta_{task}$ 로 나타내고 theta_task를 좀더 효율적으로 (ex.sparse vector로) 나타내자.
- PLM의 parameter의 subset만 바꾸자: vision에서 하던것처럼 마지막 layer만 finetune하는건 nlp에선 효과가 좀 떨어진다.
- 모델의 bias param만 finetune하는게 잘 작동한다는 보고도 잇음.
- finetuning 하는 동안 PLM의 param을 pruning. PLM의 마지막 few layer는 특히 제한적인 기능밖에 없고 아예 제거하거나 re-init해도 된다고 함.
Text-to-text fine-tuning
- gpt-3 같은 learning without update. prompt engineering
- robust to fine-tuning on small datasets, they suffer from instabilities in the few-shot setting and are sensitive to the prompt and few-shot examples
Mitigating finetuning instabilities
- 특히 작은 데이터셋에서 finetuning할때는 돌릴때마다 다르게 나오는 현상이 잇음
- output layer의 weight init과 training data 순서에 따라 영향을 받는다 함.
- 초반에 잘 안나오면 그냥 stop
- bert finetuning 할때는 small learning rate & epoch 수 증가하기를 추천
- 요즘은 adversarial 이나 trust-region based 방법이 제안
- 이전 섹션에서 살펴본것처럼 adaptive/behavioral finetuning 해보는것도 추천.

pocca2048 / ML-paper-reading

Recent Advances in LM Fine-tuning #6

Recent Advances in LM Fine-tuning