orionw / FollowIR

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
https://arxiv.org/abs/2403.15246
38 stars 0 forks source link

p-MRR Formula #7

Open jq-zhou opened 3 days ago

jq-zhou commented 3 days ago

Hi, Thank you for your valuable work! I have a questions regarding the p-MRR formula presented in your paper.

In the paper, you mention that the normalized range for p-MRR is from the worst possible change (i.e., -1) to the best possible change (i.e., 1). However, during actual calculations, I noticed some inconsistencies: p-mrr

These results seem to contradict the normalized range you mentioned. Could you clarify whether there might be an issue with the formula, or if there's something I'm missing in my understanding of the calculation?

orionw commented 3 days ago

Hi @jq-zhou and thanks for the interest!

Definitely happy to look into this - do you mind clarifying a bit? I think my confusion is that both 3/5 and -3/5 are within the -1 to 1 range.

jq-zhou commented 3 days ago

Thank you for your response.

Based on my understanding, p-MRR should be positive for items whose rank improves after instructions are applied, and negative for items whose rank decreases. Specifically:

In this way, if the overall score approaches 1, it would indicate strong instruction-following ability, while a score approaching -1 would suggest poor instruction-following ability.

orionw commented 2 days ago

Ah, is your question about the sign of the score? For the first case, if you start at R{og} = 5 and R{new}=2 (where the document is now ranked more relevant), you would expect a negative score.

This is because the documents that were changed in FollowIR w.r.t. the new instruction are no longer relevant, and so they should be ranked lower in the new. So if the rank goes up, it is doing the opposite of the instruction.

If I misunderstood your question, please let me know!

jq-zhou commented 2 days ago

It seems I misunderstood your formula. I initially thought p-MRR was calculated using documents related to the instruction.

Thank you for clarifying this, and I really appreciate your help!

orionw commented 2 days ago

It seems I misunderstood your formula. I initially thought p-MRR was calculated using documents related to the instruction.

You are right though, it is calculated using those - sorry if I was not clear. So say you have five relevant documents and two have been changed to be non-relevant (newly non-relevant) in the new instruction setting. You would loop over the newly non-relevant documents (the two) and calculate p-MRR for each one, then average over all of them for that query score (and then average over all queries for the final score).

Definitely feel free to ask any other clarifying questions, this is also great feedback for me to update the paper to make it more clear :)