어떤 내용의 논문인가요? 👋

적절한 대량의 데이터셋으로 Fine-tunning한 모델이 대화에서 흥미로운 대화 포인트를 소개하고 페르소나를 유지하는데 도움이 될 수 있습니다. unlikelihood 학습과 retrieve-and-refine 모델을 통해 지식에 대해 반복적이고 어눌하며 구체적이지 않은 답변을 피할 수 있습니다.

Abstract (요약) 🕵🏻‍♂️

Building open-domain chatbots is a challeng- ing area for machine learning research. While prior work has shown that scaling neural mod- els in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are im- portant for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and lis- tening to their partners, and displaying knowl- edge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build vari- ants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.

이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔

1. Model

1-1. Retriever

We employ the poly-encoder architecture of (Humeau et al., 2019). Poly-encoders encode global features of the context using multiple representations (n codes, where n is a hyperparameter), which are attended to by each possible candidate response

Idea

입력으로 대화 히스토리(context)가 주어지면, retrieval system이 큰 candidate response에서 scoring하고 가장 높은 점수의 응답을 출력하는 즉, Multi-Sentence Scoring을 수행하여 다음 next dialogue utterance를 선택하는 것

Poly-Encoder

Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring(Humeau et al., 2019)
Bi-Encoder와 Cross-Encoder 장점을 혼합
- inference단에서 Cross-Encoder보다는 빠르지고 Bi-Encoder보다는 나은 accuracy를 갖는다
- 3가지의 attention을 수행
  - 입력 컨텍스트의 토큰 임베딩 사이의 셀프 어텐션
  - 코드(쿼리)와 이전 셀프 어텐션의 출력 사이의 셀프 어텐션을 수행함으로써 m개의 코드를 학습
  - 후보 임베딩과 m global 학습 피쳐들 간의 셀프 어텐션
  - dot-product로 similarity 학습
Reddit 데이터로 pretrain 진행
ConvfAI2 데이터로 fine-tunning 진행
두가지의 Poly-Encoder 사이즈를 고려: 256M, 622M
- 둘다 길이는 64 code 이다

1-2. Generator

We employ a standard Seq2Seq Transformer architecture to generate responses rather than retrieve them from a fixed set.

standard seq2seq Transformer + BPE 토크나이저 사용

매우 큰 세개의 모델 사이즈를 고려: 90M, 2.7B, 9.4B

cf) Meena는 2.7B
2.7B 모델은 2 encoder layer, 24 decoder layer, 2560 임베딩 차수, 32개 attention head로 대략 Meena와 비슷하다
9.4B 모델은 4 encoder layer, 32 decoder layer, 4096 임베딩 차수, 32개 attention head

1-3. Retrieve and Refine

One approach to try to alleviate these problems is to combine a retrieval step before generation, referred to as a retrieve and refine model (Weston et al., 2018). We consider two variants for the retrieval step: dialogue retrieval and knowledge retrieval.

paper: Retrieve and Refine: Improved Sequence Generation Models For Dialogue

정말 향상 되는가? ⇒ 실험

앞서 언급한 Retriever와 Generator를 결합한 구조

외부 데이터에 대해서 답변을 못하는 문제를 해결하기 위한 하나의 접근법으로 retrieve and refine 모델(Weston et al., 2018)이라고하는 생성 전에 retrieval 스텝을 결합하는 것

⇒ 이렇게 함으로써 생성 모델이 적절한 경우 검색 결과에서 단어 나 구를 복사하는 법을 배울 수 있다는 것을 기대한다

Dialogue retrieval

input context + [SEP] + 검색된 next utterance
검색된 next utterance를 그대로 출력하는 대신 Generator를 통해 응답을 생성하도록 한다
expectation
- only Retrieve보다 더 자연스러운 문장을 생성하여 응답하겠지
- only Generator보다 검색된 next utterance는 사람이 작성한 gold response이므로 더 명확한 단어, 표현을 Retriever를 통해 배우겠지

Knowledge Retrieval

Wizard of Wikipedia task에서 제안된 IR system을 이용해(TF-IDF) input context와 가장 적합한 knowlege candidate를 가져온 후 가장 적합한 best를 generator에 전달
- IR에서 Input Context로 지식 후보 문서들을 검색한다(예를들어 wiki 아티클들)
- Input Context와 후보 문서들을 Poly-Encoder에 넣어 best 문서를 pick 한다
- 이 best 문서는 생성모델에 입력으로 들어가고 생성모델은 응답을 생성한다
Dialogue는 Generator의 입력 구조가 input context + sep + 검색 reponse인 반면에 지식 검색에서는 best knowledge candidate만 쓰는 이유는 fine-tuning data가 지식-응답간에 관계가 많아서 그렇다. 왜냐면 입력에 gold knowledge만 사용하기 때문이다

2. Training Objectives

2-1. Ranking for Retrieval

검색모델을 학습하기 위해, $y{cand1}$은 정답 응답이고 나머지는 샘플링된 negatives인 로짓인 $y{cand1}...y_{candn}$의 cross-entropy loss를 최소화 합니다. batches of 512

2-2. Likelihood Training for Generation

standard maximum likelihood estimation (MLE) 방법을 사용합니다.

Given a dataset $D = {(x(i), y(i))}$, minimize:

where $x^{(i)}$is a gold input context and $y^{(i)}$ is a gold next-utterance, and $y^{(i)}_t$ is $t$-th token of $y^{(i)}$.

2-3. α-blending for Retrieve and Refine(Dialogue retrieval)

gold label과 검색된 대화 utterance간에 관련성이 반드시 명확하지 않기 때문에, 학습 모델은 자주 단순히 검색된 utterance를 무시하는 것을 선택을한다. 검색된 utterance 사용을 보장하기 위해, 알파 타임은 gold response 대신에 검색된 response로 교체된다.

2-4. Unlikelihood training for generation

우리의 언어모델이 실제 사람이 사용한 데이터셋의 분포와 비교하여 더 많이 사용하는 것들에 대해 확률 분포를 낮춤으로써 적절한 횟수의 사용을 하도록 하는게 목표

$pθ$는 언어 모델이고 $x{<t}$는 선행토큰
unlikelihood의 log안에 괄호를 최대화하여 c의 확률 분포를 낮추는게 목표

3. Decoding

Beam search
Top-k sampling: model distribution을 사용하여 각 step i마다 가장 적합한 단어가 sampling
Minimum length: 응답 생성 시에 최소 길이를 강제로 지정, 말을 길게하라고 강제하면 구체적이게 답변할 수 있다
Predictive length: 답변마다 필요한 길이를 예측(10, 20, 30...)하고 이를 최소 길이로 설정하여 답변 생성

4. Training data

Pretraining

PushShift에서 2019 년 7 월까지 획득 한 Reddit의 1.5B 학습 예제를 포함하여 comment

다음 조건 중 하나라도 충족되면 의견과 모든 후속 아동 의견을 삭제합니다.

The author is a known bot.
It comes from a known non-English subreddit.
The comment is marked as removed / deleted.
It is longer than 2048 characters and does not contain spaces.
It is longer than 128 BPE tokens.
It is shorter than 5 characters.
It contains a URL.
It starts with a non-ASCII character.
It is further than depth 7 in the thread.

Our final dataset contains 1.50B comments totaling 56.8B label BPE tokens and 88.8B context tokens.

최종 데이터셋은 1.5B comment들, 즉 총 56.8B label BPE token 그리고 88.8B context 토큰을 가진다

Fine-tunning

ConvAI2 데이터 세트 (Zhang et al., 2018)는 성격에 중점을두고
Empathetic Dialogues (Rashkin et al., 2019)는 공감에 중점을두고
Wizard of Wikipedia (Dinan et al., 2019c)는 지식에
마지막으로 Blended Skill Talk (Smith et al., 2020)는 이러한 기술을 혼합하는 데 중점을 둔 데이터 세트
BST 태스크라 함은 위에 4개 테스크에 대해 학습된 것을 말한다

5. Evaluation Methods

ACUTE-Eval

사람 평가자는 2개의 대화 쌍을 보고 아래를 기준으로 평가해야 된다

참여도 질문 : "오래 대화하기 위해 누구와 이야기하고 싶습니까?"
인간성 질문 :“어떤 스피커가 더 인간적으로 들리는가?”

평가를 하는 annotator의 bias를 줄일 수 있다, 이전 대화와 대조되는 응답에 대한 문제를 해결 할 수있다

Self-Chat ACUTE-Eval

사람이 전체 대화를 봇이랑 진행한 후 ACUTE-Eval을 진행하는 대신 봇끼리 self-chat을 통해 대화를 진행한 후 사람에게 평가 진행. resource(사람)를 아끼고 parameter tunning 할 때 반복적으로 사용할 수 있다

같이 읽어보면 좋을 만한 글이나 이슈가 있을까요?

(Humeau 2019) Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring

(Adiwardana 2020) Towards a human-like open-domain chatbot

(Zhang 2019) DialoGPT: Large-scale generative pre-training for conversational response generation

(Weston 2018) Retrieve and Refine: Improved Sequence Generation Models For Dialogue

(Zhang 2018) Personalizing dialogue agents: I have a dog, do you have pets too?

레퍼런스의 URL을 알려주세요! 🔗

https://parl.ai/projects/recipes/ https://medium.com/dair-ai/recipes-for-building-an-open-domain-chatbot-488e98f658a7 https://towardsdatascience.com/blender-bot-part-3-the-many-architectures-a6ebff0d75a6

modulabs / beyondBERT

Recipes for building an open-domain chatbot #16