nouhadziri / DialogEntailment

The implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment"
https://arxiv.org/abs/1904.03371
MIT License
74 stars 5 forks source link
bert dialogue-evaluation evaluation-framework natural-language-inference

This repository hosts the implementation of the paper "Evaluating Coherence in Dialogue Systems using Entailment", published in NAACL'19.

DialogEntailment

DialogEntailment is a microframework to automatically evaluate coherence in dialogue systems. Our implementation includes the following metrics:

Note that in the paper, we reported distance for the semantic similarity, but in the code, we named the metric SemanticDistance (i.e., the lower the better). We also provided SemanticSimilarity that actually computes the similarity.

Installation

DialogEntailment is shipped as a Python package and can be installed using pip:

git clone git@github.com:nouhadziri/DialogEntailment.git
pip install -e .
python -m spacy link en_core_web_lg en

Dependencies

Dataset

We build a syntenthized entailment corpus, namely InferConvAI, from the ConvAI dialogue data [Zhang et al., 2018], described in details in the paper. The dataset is formatted in both tsv (similar to MultiNLI) and jsonl (following SNLI). To download InferConvAI, please use the following links:

Check out convai_to_nli.py to see how the synthesized inferenece data is generated from the utterances.

Train an Entailment model

We adopt two prominent models that have shown promising results in commonsense reasoning:

Visualization

You may run the dialogentail module to replicate the plots provided in the paper:

python -m dialogentail --bert_dir <BERT_DIR> --esim_model <ESIM_MODEL> [--plots_dir <DIR>]

For the ESIM model, you need to input model.tar.gz which is generated by allennlp in the model directory once the training is finished.

Note that loading the BERT model and the ESIM model in the same process requires massive amount of memory, so we recommend to run the above command for each model separately.

Custom Test Data

The default test data is 150 dialogues drawn from Reddit (used in THRED for human evaluation). We also provided a 150-dialogue test data from OpenSubtitles. You can change the test data by the --response_file argument. To use our OpenSubtitles data, simply pass --response_file opensubtitles. For your own test data, the file format should be the following for each test sample (see our [Reddit]() data for more information):

Line N: TAB-separated utterances in the conversation history
Line N+1: the ground-truth response
Line N+2: Response generated by Method_1
Line N+3: Response generated by Method_2
...
Line N+m+1: Response generated by Method_m  

Run the program with the following arguments:

    --response_file     Path to your test file
    --generator_types   The names of 'm' generative models

By default, the program evaluates the following m=4 models:

Correlation with Human Judgment

To measure the correlation with human judgment, you need to provide a pickle file containing the mean evaluation ratings of your human judges. More precisely, the pickle file consists of a python list containing triples ('Method_i', sample_index, mean_rate). If you have m generative models and N test samples, the size of the list would be N * m:

[('Method_1', 1, 2.1), ('Method_2', 1, 3.4), ..., ('Method_m', 1, 2.6), ('Method_1', 2, 0.2), ...]

To pass your own human judgment file, use --human_judgment <PATH_TO_PICKLE_FILE>. For the OpenSubtitles test data, you may simply set the argument to opensubtitles to use the provided human judgment.

Citation

Please cite the following paper if you used our work in your research:

@inproceedings{dziri-etal-2019-evaluating,
    title = "Evaluating Coherence in Dialogue Systems using Entailment",
    author = "Dziri, Nouha  and
      Kamalloo, Ehsan  and
      Mathewson, Kory  and
      Zaiane, Osmar",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/N19-1381",
    doi = "10.18653/v1/N19-1381",
    pages = "3806--3812",
}