About this work incorrectly using the WoW dataset

alexhsg commented 2 years ago

Hi, I have followed the research work on document-grounded conversation for several years. I am very interested in your work with your proposed focused attention. By studying your code, issue #2 and conducting related experiments myself, I confirm that your work has produced inaccurate results due to the wrong use of wow data, which achieves unigram f1 score above 31.

You mistakenly used the retrieved passages (7 passages for each sample) of current turn as grounded document(d_i) referring to issue #2, but previous works such as WoW dataset paper[1], SLKS[2], KnowledGPT[3], DiffKS[4], KIC[6], DukeNet[5], DIALKI[7] all use the last two turn retrieved passage (14 passages) as their grounded document bacause the retrieved passage of current turn is retrieved by ground-truth response text. You can't use the inputs containing golden response infomation to generate the response of your model, which is unfair to previous work.

We take this dialog as example:

{
  "chosen_topic": "Science fiction",
  "persona": "i enjoy movies about aliens invading the earth.",
  "wizard_eval": 5,
  "dialog": [
    {
      "speaker": "0_Wizard",
      "text": "I think science fiction is an amazing genre for anything. Future science, technology, time travel, FTL travel, they're all such interesting concepts.",
      "checked_sentence": {
        "chosen_Science_fiction_0": "Science fiction (often shortened to SF or sci-fi) is a genre of speculative fiction, typically dealing with imaginative concepts such as futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life."
      },
      "checked_passage": {
        "chosen_topic_0_Science_fiction": "Science fiction"
      },
      "retrieved_passages": [...],
      "retrieved_topics": [
        "Hyperspace (science fiction)",
        "Science fiction",
        "History of science fiction",
        "Science fiction film",
        "Time travel",
        "List of starships in Stargate",
        "History of US science fiction and fantasy magazines to 1950"
      ]
    },
    {
      "speaker": "1_Apprentice",
      "text": "I'm a huge fan of science fiction myself! ",
      "retrieved_passages": [...],
      "retrieved_topics": [
        "Science fiction",
        "History of science fiction",
        "Isaac Asimov",
        "U.S. television science fiction",
        "History of US science fiction and fantasy magazines to 1950",
        "Starstruck (comics)",
        "LGBT themes in speculative fiction"
      ]
    },
    {
      "speaker": "0_Wizard",
      "text": "Awesome! I really love how sci-fi storytellers focus on political/social/philosophical issues that would still be around even in the future. Makes them relatable.",
      "checked_sentence": {
        "self_Science_fiction_film_1": "Science fiction films have often been used to focus on political or social issues, and to explore philosophical issues like the human condition."
      },
      "checked_passage": {
        "self_3_Science_fiction_film": "Science fiction film"
      },
      "retrieved_passages": [...],
      "retrieved_topics": [
        "Oddworld Inhabitants",
        "Legalism (Chinese philosophy)",
        "Sci-Fi on the Rock",
        "Starstruck (comics)",
        "The Spirit of the Age",
        "Science fiction film",
        "Music of the Marvel Cinematic Universe"
      ]
    }
  ]
}

To generate the response "Awesome! I really love..." of current turn, we should use both the last-1 Apprentice retrieved passage(7 passages) and last-2 Wizard retrieved passage(7 passages) instead of the current turn Wizard retrieved passage(7 passages) as grounded document. You can study the dataset preprocess code of KnowledGPT (line118-135) for detail.

In fact, I can easily achieve f1 31.6 on WoW test seen by simply using BART to generate response with the retrieved passages (7 passages) of current turn and the dialogue context as inputs, which is almost the best results on your paper (DoHA f1 31.8). However, when the input is changed to 7 passages of last turn, the f1 metric of the BART baseline drops to 21.5, which matches the results of the previous work(SLKS[2], KnowledGPT[3], DiffKS[4], KIC[6], DukeNet[5], DIALKI[7]).

@shrimai may re-verify wow's data processing code compared to other previous work (SLKS, knowledGPT, DIALKI), and modify the experimental results of the paper for fair comparsion.

[1] Dinan E, Roller S, Shuster K, et al. Wizard of wikipedia: Knowledge-powered conversational agents[J]. arXiv preprint arXiv:1811.01241, 2018. [2] Kim B, Ahn J, Kim G. Sequential latent knowledge selection for knowledge-grounded dialogue[J]. arXiv preprint arXiv:2002.07510, 2020. [3] Zhao X, Wu W, Xu C, et al. Knowledge-grounded dialogue generation with pre-trained language models[J]. arXiv preprint arXiv:2010.08824, 2020. [4] Zheng C, Cao Y, Jiang D, et al. Difference-aware knowledge selection for knowledge-grounded conversation generation[J]. arXiv preprint arXiv:2009.09378, 2020. [5] Meng C, Ren P, Chen Z, et al. Dukenet: A dual knowledge interaction network for knowledge-grounded conversation[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020: 1151-1160. [6] Lin X, Jian W, He J, et al. Generating informative conversational response using recurrent knowledge-interaction and knowledge-copy[C]//Proceedings of the 58th annual meeting of the association for computational linguistics. 2020: 41-52. [7] Wu Z, Lu B R, Hajishirzi H, et al. DIALKI: Knowledge Identification in Conversational Systems through Dialogue-Document Contextualization[J]. arXiv preprint arXiv:2109.04673, 2021.

shrimai commented 2 years ago

Thank you for your interest in my work. I believe there is a confusion in the setup of Wow and this work. They are two different set ups.

In our work, we don’t focus on retrieving the knowledge but focus on generating response that is grounded in ground-truth knowledge. Wow has a two staged approach: (1) Knowledge Retrieval: retrieve knowledge given a corpus of paragraphs. Yes, the last two turns of the dialogue are used to retrieve knowledge from a corpus. The knowledge that is retrieved by the retriever is supposed to help in generating the response i.e it should contain information on the which the response can be based. Hence, the ideal case scenario for this stage is that the retriever returns the golden knowledge associated to the response given the last tow turns (2) Utterance Prediction: Given the retrieved knowledge, predict the utterance.

Notice that the error of the retriever will be propagated to stage 2 of utterance prediction. We wanted to study grounding in generation independently i.e only the stage 2 of the WoW setup. Hence, in our setup we assume we have a perfect retriever or an oracle which tells us the knowledge to be used in response generation (hence we use the knowledge associated with the context and the current turn) and we only study if this knowledge can be transferred to responses. We want our documents to contain the knowledge that is to be generated and we propose techniques to incorporate that knowledge in generation.

Yes, using BART alone, the paper has reported an f1 of 31.1, so getting a score of 31.6 seems close enough. You should get a better f1 by using CoDR and DoHA in the same setup.

We have not compared our work with other techniques that have a knowledge selection component and retrieve a knowledge. We only compare with techniques that follow our setup of using the knowledge associated with the context for response generation and only study response generation. Hence, the comparisons in the paper are fair.

alexhsg commented 2 years ago

Hi shrimai,

Thank you for your patient reply! I totally agree with the reasons for your experimental setup and understand how your work differs from other methods with knowledge selector. Sepecifically, what I'm referring to is that your approach doesn't do a fair comparison with other methods such as Low-Res[1]. In Table 1 of your paper, the f1 of Low-Res is 18.0 while your method achieves 31.8 on TestSeen. However, Low-Res fails to access the passage retrieved by ground-truth response, but your method access the golden retrieved passages and obtain 31.8 f1.

Low-Res focus on generation task without knowledge selector and f1 18.0 is the result on full data setting. Low-Res compares to TMN(f1 15.9)[2] and ITDD(f1 16.2)[3]. TMN is the baseline released with the WoW dataset. These work, whether they equips with knowledge selector or not, invariably used the retrieved passages of last two turn as grounding document, and their generation results invariably had a continuum of performance improvement(TMN15.9, ITDD16.2, Low-Res18.0, SLKS19.3, DukeNet19.3, KnowledGPT22.0). Only this work use the golden passages retrieved by ground-truth response and achieves the best f1(31.8).

To summarize, my point is that the huge improvement from Low-Res(18.0) to DoHA(31.8) is not primarily due to the use of BART as a decoder, nor to the proposed module, but rather to the different data processing approaches taken by thees two work. This should not have been put directly in the same table for comparison but should have been supplemented with relevant notes.

I'm sorry for taking the time for you to reply to me. In fact the improvement from the BART(31.1) baseline to DoHA(31.8) is real and reliable. Again sorry for my late reply. My question is not closed yet, the issue should be open I think.

[1] Zhao X, Wu W, Tao C, et al. Low-resource knowledge-grounded dialogue generation[J]. arXiv preprint arXiv:2002.10348, 2020. [2] Dinan E, Roller S, Shuster K, et al. Wizard of wikipedia: Knowledge-powered conversational agents[J]. arXiv preprint arXiv:1811.01241, 2018. [3] Li Z, Niu C, Meng F, et al. Incremental transformer with deliberation decoder for document grounded conversations[J]. arXiv preprint arXiv:1907.08854, 2019.

shrimai / Focused-Attention-Improves-Document-Grounded-Generation

About this work incorrectly using the WoW dataset #5