Improving Language Understanding by Generative Pre-Training (GPT 1)

jinmang2 commented 2 years ago

집현전 중급반 스터디

2022년 4월 17일 일요일 9시
김택현님 신원지님 한나연님 한다솜님 발표
논문 링크: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

Abstract

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

jinmang2 commented 2 years ago

Aux LM은 어떤 태스크인가요?
Decoder를 왜 12 layer를 쌓았을까요?

HanNayeoniee commented 2 years ago

논문 abstract에서 기존 모델과 달리 fine-tuning시에 모델 아키텍쳐에 작은 변화를 적용했다고 하는데 어떤 변화를 의미하나요?

In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.

+) BERT와 비교해 GPT의 fine-tuning이 더 수월한가요? (BERT는 GPT 이후에 발표되었기 때문에 BERT와 GPT를 비교할 수 있는지부터 확인해 봐야 함)

dobbytk commented 2 years ago

Aux LM은 어떤 태스크인가요?

Decoder를 왜 12 layer를 쌓았을까요?

논문 abstract에서 기존 모델과 달리 fine-tuning시에 모델 아키텍쳐에 작은 변화를 적용했다고 하는데 어떤 변화를 의미하나요?

Aux LM은 본 논문에서는 사전학습 언어모델의 목적함수를 의미합니다. 이는 fine-tuning 단계에서 auxiliary objective로 사용되어 아래 식을 보시면 $L_1(C)$ 앞에 가중치( $\lambda$ )를 곱해 합하는 식으로 최적화가 이루어지고 있습니다. 이는 일반화 성능을 향상시키고, 학습 속도를 높이는 역할을 합니다. cf) $L_2(C)$ 는 supervised fine-tuning의 목적함수입니다.

Figure 2의 왼쪽 그래프를 보시면 디코더 블록을 쌓았을 때의 특정 데이터셋에 대해 성능이 향상되는 것을 볼 수가 있습니다. 왜 정확히 12개를 사용했는지에 대한 확실한 이유를 알 수는 없으나 트랜스포머 Base 모델에선 6개 이상 블록을 쌓으면 오히려 성능이 떨어지기 때문에 6개의 블록을 사용하는 것으로 알고 있습니다. 따라서 이와 비슷한 맥락에서 GPT1 디코더 블록도 12개를 채택하지 않았을까 하는 것이 제 생각입니다. 이와 관련해서 답을 아시는 분은 댓글 달아주시면 감사드리겠습니다.
논문 abstract에서 'minimal changes'는 이전의 사전학습 모델은 fine-tuning시 모델 구조를 변형해야하는 문제점이 있는데, GPT1은 모델 구조를 건드리지 않고 마지막에 간단하게 linear layer를 추가하면 된다는 점에서 minimal changes라는 말을 사용했다고 생각하시면 되겠습니다.

jinmang2 commented 2 years ago

발표 링크: https://www.youtube.com/watch?v=GlfC-9Cajus&list=PL2tRglRS_GqjCcBVM26gYNdoOnX12lLIh&index=5

jiphyeonjeon / season3

Improving Language Understanding by Generative Pre-Training (GPT 1) #13

집현전 중급반 스터디

Abstract