trecrts / trecrts.github.io

TREC RTS homepage
http://trecrts.github.io/
8 stars 1 forks source link

TREC2016 RTS Evaluation Metric: Treatment of "silent days" #7

Closed LuchenTan closed 8 years ago

LuchenTan commented 8 years ago

Hello all,

I'm Luchen Tan, one of the co-organizers from UWaterloo.

We are discussing the evaluation metrics for this year here. Our group at UWaterloo explored various evaluation metrics for the TREC2015 Microblog Track, see:

https://cs.uwaterloo.ca/~jimmylin/publications/Tan_etal_SIGIR2016a.pdf

In Section 3 of the paper, for a particular topic, we defined “silent days” (when there are no relevant tweets for the topic in the pool) and “eventful days” (when there are relevant tweets for the topic in the pool). There comes to a corner case for a “silent day”. The question is: if a system remains silent for a silent day, should we reward the behaviour? Last year, we rewarded and gave perfect score (one) for this case. In the paper above, we also tested score 0 for this. To distinguish them, we call them ELG-1 and nCG-1, ELG-0 and nCG-0. Note, ELG-1 and nCG-1 are the official metrics defined in TREC2015. And for ELG-0 and nCG-0, systems don't get credit for recognizing silent days. Systems do well under ELG-1 and nCG-1 by learning when to "shut up”. While a high ELG-0 and nCG-0 score system can focus more on retrieving relevant tweets. ELG-1 and ELG-0 (same to nCG) have no correlation (see Figure 1 in the paper).

We also tested a pain-gain measure T11U as used for TREC2012 Microblog Track. This value can be understood as setting the gain of a relevant notification (highest relevance grade, no temporal penalty, not redundant) equal to the pain of returning two non-relevant updates.

This year, we propose to measure ELG-1(major ranking metric), ELG-0, nCG-1, nCG-0 and T11U. It seems to be quite a lot measurements.

Some comments also mentioned that since ELG is precision-like and nCG is recall-like, we can propose a F1-like metric to average these two.

Any comments, thoughts on this year’s evaluation metrics?

telsayed commented 8 years ago
telsayed commented 8 years ago

One more thing about topic selection is that we should try to select topics for which we expect relevant tweets within the evaluation period to avoid the problem of the silent days.

LuchenTan commented 8 years ago

@telsayed Thanks for your comments. Yes, we agree that knowing when to keep silent should be awarded. We suggest using EG-1 as our major metric. And bringing in some silent topics is a way to test whether the systems are able to recognize when they should be silence.

lintool commented 8 years ago

One more thing about topic selection is that we should try to select topics for which we expect relevant tweets within the evaluation period to avoid the problem of the silent days.

Of course we will strive to do this, but it's difficult to predict ahead of time...

salman1993 commented 8 years ago

Since the task is meant for push notifications, I think it is important to reward systems for staying quiet in a silent day since these notifications impose a burden on the user. Since T11U does not make the distinction between eventful and silent days, I would also suggest ELG-1 as our major metric.

LuchenTan commented 8 years ago

Thank you for your opinion, @salman1993. We agree with you to choose EG-1 as our major metric. One change is that we won't add arbitrary max acceptable delay time (eg. 100 mins) this year. We will report latency aside. More details please find at: http://trecrts.github.io/TREC2016-RTS-guidelines.html

KaranSabhnani commented 8 years ago

I believe that systems should be rewarded for knowing when to stay quiet. Although a user doesn't know how to reward a system for staying quiet, as no relevant information is shared, this reward is justified in the sense that system is not unnecessarily bothering the user. It is the user's faith in the system that it is comprehensive and at the same time won't disturb with false positives. And thus, I agree with EG-1 being the major metric. I also like the idea of reporting multiple measurements covering different user perspectives.

LuchenTan commented 8 years ago

As the experiment results we did in the paper, the system rankings are more or less based on the metrics, which we may not want to see. But as you said, they represent different user models. So we are suggesting to report as many as we can.

KaranSabhnani commented 8 years ago

@LuchenTan I agree that the answer to which system is better should not depend on the evaluation metric.

lintool commented 8 years ago

Guidelines finalized. Closing issue.