about the summary_pairs.json's result

FYYFU commented 3 years ago

i download the dae_w_syn_hallu model and use it to run the commond you provided. The result is:

291.0
373.0
0.7801608579088471

That seems a little diffence from you provided in the original paper( The Table 4, DAE -- 83.9). The stanfordCoreNLP version is 3.9.1.

I also use the version 4.1.0, and get the result:

294.0
373.0
0.7882037533512064

I wonder if there is somethings i should to notice.?

tagoyal commented 3 years ago

Hi, the results in Table 4 (and all results from Section 6 onwards) are reported using the dae_w_syn model. Using that should help you reproduce the numbers from the paper.

FYYFU commented 3 years ago

Hi, the results in Table 4 (and all results from Section 6 onwards) are reported using the dae_w_syn model. Using that should help you reproduce the numbers from the paper.

Thanks for you reply ! I get the result in the table 4 using the dae_w_syn model. In the Table 3, the model trained with hallucinations( AD + S + H) outperform the AD+S in the test set except the AD test set. So could you tell me the reason to choose AD + S?

And if i want to use this tools to compare the hallucination & factuality of two summary. Should I use the AD + S + H model?

tagoyal commented 3 years ago

Sorry for the delayed response. The AD + S model was chosen based on our expectation of the data being evaluated. The full AD + S + H model is useful if you expect that the (input, generation) pairs will be unrelated in terms of the events and actors being discussed.

For paraphrasing/ (one sentence input, summary) tasks (like the ones in the paper), I would suggest using the AD + S model.

For (document, summary) pairs, I would suggest using the AD + S + H model. Note that the maximum input length allowed by this model is 128, so you may have to segment the document into sentences to be able to use this. Alternatively, you can use models from our follow-up work that operate at the document level: https://arxiv.org/pdf/2104.04302.pdf

FYYFU commented 3 years ago

Sorry for the delayed response. The AD + S model was chosen based on our expectation of the data being evaluated. The full AD + S + H model is useful if you expect that the (input, generation) pairs will be unrelated in terms of the events and actors being discussed.

For paraphrasing/ (one sentence input, summary) tasks (like the ones in the paper), I would suggest using the AD + S model.

For (document, summary) pairs, I would suggest using the AD + S + H model. Note that the maximum input length allowed by this model is 128, so you may have to segment the document into sentences to be able to use this. Alternatively, you can use models from our follow-up work that operate at the document level: https://arxiv.org/pdf/2104.04302.pdf

Thanks for your reply! i will read this paper carefully.

tagoyal / dae-factuality

about the summary_pairs.json's result #3