sunnweiwei / user-satisfaction-simulation

"Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems" in SIGIR'21
33 stars 4 forks source link
dialogues user-satisfaction user-simulation

Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

We annotated a dialogue data set, User Satisfaction Simulation (USS), that includes 6,800 dialogues. All user utterances in those dialogues, as well as the dialogues themselves, have been labeled based on a 5-level satisfaction scale. See dataset.

These resources are developed within the following paper:

Weiwei Sun, Shuo Zhang, Krisztian Balog, Zhaochun Ren, Pengjie Ren, Zhumin Chen, Maarten de Rijke. "Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems". In SIGIR. Paper link

Data

The dataset (see dataset) is provided a TXT format, where each line is separated by "\t":

And sessions are separated by blank lines.

Since the original dataset does not provide actions, we use the action annotation provided by IARD and included it in ReDial-action.txt.

The JDDC data set provides the action of each user utterances, including 234 categories. We compress them into 12 categories based on a manually defined classification method (see JDDC-ActionList.txt).

Data Statistics

The USS dataset is based on five benchmark task-oriented dialogue datasets: JDDC, Schema Guided Dialogue (SGD), MultiWOZ 2.1, Recommendation Dialogues (ReDial), and Coached Conversational Preference Elicitation (CCPE).

Domain JDDC SGD MultiWOZ ReDial CCPE
Language Chinese English English English English
#Dialogues 3,300 1,000 1,000 1,000 500
Avg# Turns 32.3 26.7 23.1 22.5 24.9
#Utterances 54,517 13,833 12,553 11,806 6,860
Rating 1 120 5 12 20 10
Rating 2 4,820 769 725 720 1,472
Rating 3 45,005 11,515 11,141 9,623 5,315
Rating 4 4,151 1,494 669 1,490 59
Rating 5 421 50 6 34 4

Baselines

The code for baseline reproduction can be found within /baselines.

Performance for user satisfaction prediction. Bold face indicates the best result in terms of the corresponding metric. Underline indicates comparable results to the best one.

 Performance for user action prediction. Bold face indicates the best result in terms of the corresponding metric. Underline indicates comparable results to the best one.

Cite

@inproceedings{Sun:2021:SUS,
  author =    {Sun, Weiwei and Zhang, Shuo and Balog, Krisztian and Ren, Zhaochun and Ren, Pengjie and Chen, Zhumin and de Rijke, Maarten},
  title =     {Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems},
  booktitle = {Proceedings of the 44rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
  series =    {SIGIR '21},
  year =      {2021},
  publisher = {ACM}
}

Contact

If you have any questions, please contact sunnweiwei@gmail.com