Reference order changes output scoring

danieldeutsch commented 4 years ago

Hi,

I think there is a bug in which the order of the references changes the final score of the peer summary. Here is a concrete example of input documents:

Raw/peers/0: Dan bought scones today. Raw/model/A: Dan went to buy scones at the store. Raw/model/B: Dan bought scones today. He also went to the bakery.

The output from this is [{'0': 2}, {'0': 1}, {'0': 0}, {'0': 0.5}]. However, if I change Raw/model/A to be called Raw/model/C, the output is [{'0': 2}, {'0': 1.0}, {'0': 1.0}, {'0': 1.0}]. My expectation is that the order of the reference summaries shouldn't matter.

Thanks!

serenayj commented 4 years ago

Hi Daniel,

Thanks for raising this issue. The order of reference summaries in the Raw/model folder does not affect the overall quality of resulted pyramid (in terms of ranking the target summaries or systems), but it will affect what the pyramid looks like and its summary content units. PyrEval reads in reference summaries in a default ordering, builds and traverses a content graph, therefore there are some differences in allocating propositions into summary content units if the order changes, depending on which propositions it sees first. It is an emergent process by nature: when building the pyramid, human annotators read and find the propositions in a given reference and decide which summary content unit it belongs, and move on to the next reference. Annotators might read and pick out propositions in different order, thus the resulted pyramid and scores will look different, but the ranking will be the same, proved in many previous empirical studies. For PyrEval, this is shown by the high correlation of rankings with manual pyramids.

From what you mention, I think we could improve PyrEval by adding option to shuffle the order of reference summaries and build different pyramids, and provide an average score from these pyramids.

rpassonneau commented 4 years ago

Technically, order of input should not matter. Human annotators are not instructed to go by the propositions they see first, they are instructed to go by the groupings that make the most sense, and to iterate and change their minds as they construct the pyramid. Humans are potentially biased by the order in which they read the summaries, but the annotation goal is for annotators to overcome this bias.

The possible order bias is a different issue than individual differences, and the fact that different human-constructed pyramids tend to produce the same groupings of words, as shown by interannotator agreement, and the same correlations of scores.

Yanjun, is there a way to make order not matter in PyrEval?

Becky

On 8/8/20 5:00 PM, Serena Gao wrote:

Hi Daniel,

Thanks for raising this issue. The order of reference summaries in the |Raw/model| folder does not affect the overall quality of resulted pyramid (in terms of ranking the target summaries or systems), but it will affect what the pyramid looks like and its summary content units. PyrEval reads in reference summaries in a default ordering, builds and traverses a content graph, therefore there are some differences in allocating propositions into summary content units if the order changes, depending on which propositions it sees first. It is an emergent process by nature: when building the pyramid, human annotators read and find the propositions in a given reference and decide which summary content unit it belongs, and move on to the next reference. Annotators might read and pick out propositions in different order, thus the resulted pyramid and scores will look different, but the ranking will be the same, proved in many previous empirical studies. For PyrEval, this is shown by the high correlation of rankings with manual pyramids.

From what you mention, I think we could improve PyrEval by adding option to shuffle the order of reference summaries and build different pyramids, and provide an average score from these pyramids.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fserenayj%2FPyrEval%2Fissues%2F7%23issuecomment-670973343&data=02%7C01%7Crjp49%40psu.edu%7C0ad76a5abbd24a48e4e008d83bde1b9f%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C1%7C637325172431362146&sdata=0%2BtHz2rQZxdt06i3eKRj%2F5ecikP7Y%2BaHejgnSLFqW9o%3D&reserved=0, or unsubscribe https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA334XWQ74MA3B7Y4KWRQZ3R7W4HLANCNFSM4PYWT63Q&data=02%7C01%7Crjp49%40psu.edu%7C0ad76a5abbd24a48e4e008d83bde1b9f%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C1%7C637325172431372146&sdata=KdoaqG2BaZuIvV2%2BNrIY5s6P6RHnpIX%2FgM8GRf9xn24%3D&reserved=0.

serenayj / PyrEval

Reference order changes output scoring #7