Why do you use partial match max metric for QA

nvtransfer / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

Apache License 2.0

646 stars 43 forks source link

Why do you use partial match max metric for QA #15

Closed vkaul11 closed 4 months ago

vkaul11 commented 5 months ago

Just wanted to know why we have https://github.com/hsiehjackson/RULER/blob/main/scripts/eval/synthetic/constants.py#L25 Why is this different from string_match_all for QA specifically ? Basically if any of the predictions match the reference, it is ok ? I didn't quite understand this well.

 def string_match_part(preds, refs):
    score = sum([max([1.0 if r.lower() in pred.lower() else 0.0 for r in ref]) for pred, ref in zip(preds, refs)]) / len(preds) * 100
    return round(score, 2)

hsiehjackson commented 5 months ago

string_match_part can get 100% score when matching one of the references; string_match_all should match all of the references to get 100% score. The reason we use string_match_part in QA tasks is because most of the references are paraphrase sentences. Matching one of the references for QA is enough.