A Simple Language Model for Task-Oriented Dialogue

어떤 내용의 논문인가요? 👋

end-to-end 기반의 TOD 모델인 Simple TOD 모델을 제안하는 내용입니다.

Abstract (요약) 🕵🏻‍♂️

Task-oriented dialogue is often decomposed into three tasks: understanding user input, deciding actions, and generating a response. While such decomposition might suggest a dedicated model for each sub-task, we find a simple, unified approach leads to state-of-the-art performance on the MultiWOZ dataset. SimpleTOD is a simple approach to task-oriented dialogue that uses a single, causal language model trained on all sub-tasks recast as a single sequence prediction problem. This allows SimpleTOD to fully leverage transfer learning from pre-trained, open domain, causal language models such as GPT-2. SimpleTOD improves over the prior state of-the-art in joint goal accuracy for dialogue state tracking, and our analysis reveals robustness to noisy annotations in this setting. SimpleTOD also improves the main metrics used to evaluate action decisions and response generation in an end-to-end setting: inform rate by 8.1 points, success rate by 9.7 points, and combined score by 7.2 points.

이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔

1) 전통적인 대화 시스템 (Dialogue systems)의 개념

Open Domain Dialogue Systems(chit-chat) VS Task-oriented Dialogue Systems

TOD의 구성 요소

유저 발화 이해 : Understanding user input
행동 결정 : deciding actions
응답 생성 : generating a response.

전통적으로 TOD의 각 구성 요소들은 독립적으로 훈련되어 왔음. NLU 모듈은 domain and intent labels, DM은 dialogue belief and dialogue act labels, NLG 모듈은 templatized or natural responses을 대상으로 함.

문제점 : 파이프라인에서 각 구성 요소들의 의존성은 오류를 전파할 수 있음. 예를 들면 많은 시스템이 모든 말차례(발화 턴)의 대화 이력을 고려하지 않고 NLU 모듈에 의존하여 이어지는 단계로 belief staes를 전달함.

해결법 : End-to-End Task-Oriented Dialogue, Unsupervised pre-training

2) SimpleTOD의 구조

SipleTOD는 하나의 일반적인 언어 모델로 대화 맥락과 DB 검색 결과를 통해 모든 output을 생성함.

blog 그림 참고 : https://blog.einstein.ai/simpletod/

3.1 Task-Oriented Dialogue

t : turn(말차례, 발화 턴)
U_t : User Input (유저 발화)
S_t : System generates a response(시스템 생성 응답)
C_t: Context(all previous turns) = [U0, S0,....,Ut](맥락, 앞선 모든 말차례)
B_t: Belief state which is a list of triplets recording values for slots in a particular domain: (domain, slot_name, value).
D_t : Input the aggregated database search results, includes how many rows were returned and, depending on the experimental setting, whether booking status information.
A_t : Single sequence to decide *actions**

3.2 Causal Language Modeling

일반적인 언어 모델 사용, 하나의 훈련 시퀀스는 x^t = [C_t; B_t; D_t; A_t; S_t]로 구성됨. D = {x¹, . . . , x|D|}의 negative log-likelihood를 최소화하는 방식으로 훈련

3.3 Architecture

Variant of the Transformer

A sequence containing n tokens is embedded as a sequence of n vectors in R^d. Each vector is the sum of a learned token embedding and a sinusoidal positional embedding. The sequence of vectors is stacked into a matrix X₀ ∈ R^n×d and processed by l attention layers. The ith layer consists of two blocks, each preserving model dimension d. The first block uses multi-head attention with k heads. A causal mask precludes attending to future tokens:

The second block uses a feedforward network with ReLU activation that projects inputs to an inner dimension f. This operation is parameterized by U ∈ R^d×f and V ∈ R^f×d:

Each block precedes core functionality with layer normalization and follows it with a residual connection. Together, they yield X_i+1:

Scores are then computed from the output of the last layer:

During training, these scores are the inputs of a cross-entropy loss function. During generation, the scores corresponding to the final token are normalized with a softmax, yielding a distribution for sampling a new token.

3) 훈련 세부 내용

토크나이즈 : pretrained BPE codes 프리트레인 모델 : DistilGPT2 하이퍼 파라미터 : default hyperparameters for GPT-2 and DistilGPT2 in Huggingface Transformers 길이 : Sequences longer than 1024 tokens are truncated.

4) 훈련 데이터 : MultiWoZ

Multi-domain Wizard-of-Oz (MultiWOZ) : 인간 대 인간 대화 규모 : 10438 multi-turn dialogues with 13.68 average turns, 도메인 : 총 7개, restaurant, train, attraction, hotel, taxi, hospital, police 특성 : Police and hospital domains are excluded from evaluation, since they do not have valid/test splits. This leaves 30 domain-slot pairs for the remaining five domain with 4,500 possible values.

5) Simple TOD 평가 방식과 결과

sub-tasks: dialogue state (belief state) tracking, dialogue management (action/decision prediction) and response generation. MultiWOZ guidance for all individual metrics and follow Mehri et al. Joint goal accuracy : dialogue state tracking (i.e. belief state tracking)의 성능 평가. 생성된 belief states와 oracle belief states의 성능을 비교 측정. 모델 output이 oracle value와 동일할 때만 인정. inform : how often the entities provided by the system are correct success : how often the system is able to answer all the requested attributes by user. BLUE score : the fluency of the generated responses. The combined score : action and response generation is computed as (BLEU + 0.5 ∗ (Inform + Success)).

같이 읽어보면 좋을 만한 글이나 이슈가 있을까요?

Simple TOD github https://github.com/salesforce/simpletod

Simple TOD blog https://blog.einstein.ai/simpletod/

MultiWOZ github https://github.com/budzianowski/multiwoz

신경망 기반 대화 시스템에 대한 개괄적인 내용[95쪽ㅎㅎㅎ] J. Gao, M. Galley, L. Li, et al. Neural approaches to conversational ai. Foundations and TrendsR in Information Retrieval, 13(2-3):127–298, 2019. https://arxiv.org/pdf/1809.08267.pdf

의미론에서의 담화/화용 연구 Austine(1962)과 Searle(1969) 화행 이론(speech act), 언어적 행위, 수행적 행위, 언향적 행위 John Grice(1975) 협동 이론(Cooperative Principle), 대화의 격률(양, 질, 관계, 태도) Levinson(1983) 화용론(Pragmatics)

한국어 대화시스템 관련 자료(서강대 서정연 교수님 자료) http://sigai.or.kr/workshop/AI-for-everyone/2017/slides/대화-인터페이스-구현에-관련된-자연어-처리와-인공지능-기술-이야기.pdf

레퍼런스의 URL을 알려주세요! 🔗

챗봇 구조 그림 https://hijigoo.github.io/nlp/2020/05/16/dialog-system-01/

Open Domain Chit-Chat Vs TOD 그림 https://arxiv.org/pdf/1709.10217.pdf

modulabs / beyondBERT