lm-pub-quiz / lm-pub-quiz

Evaluate language models using multiple choice items
https://lm-pub-quiz.github.io
MIT License
12 stars 2 forks source link

Additional result features #4

Closed plonerma closed 2 months ago

plonerma commented 3 months ago

This PR improves interaction with dataset & result objects. In particular, it provides improved capabilities for accumulating relation scores based on predetermined relation types (e.g. domains). The annotation features enables the distribution of annotations independent of the dataset itself.

Instance Table for the Complete Dataset

This feature allows the user to easily produce a DataFrame with all of the predictions for the entire dataset (instead of only one per relation):

from lm_pub_quiz import DatasetResults

results = DatasetResults.from_path("tests/test_data/new_style_results")
print(results.joined_instance_table())
                   sub_id        sub_label  answer_idx                                        pll_scores obj_id        obj_label
relation  instance                                                                                                              
example_1 0           xyz     the traveler           0  [-39.4665284157, -41.3839921951, -40.8367753029]    zyx     the souvenir
          1           abc  the sports team           1  [-43.2029886246, -38.8553466797, -52.4907488823]    cba     the football
          2           pou      the surgeon           2  [-37.6923413277, -42.2200908661, -31.9606842995]    uop      the scalpel
example_2 0           biq             meat           0  [-12.1812758446, -17.5053930283, -15.8877744675]    dtz         the lion
          1           myn           a bone           0  [-23.4390444756, -27.9115614891, -27.5443906784]    dtz         the lion
          2           ejy            candy           1  [-19.3051533699, -15.7057299614, -24.0674858093]    irw          the kid
          3           sgq    a green apple           1  [-19.5387759209, -19.5001821518, -23.1424221992]    irw          the kid
          4           jpl           a leaf           2  [-25.6837468147, -27.8830237389, -19.3651065826]    zte  the caterpillar
          5           pmz          a plant           2   [-21.2618765831, -26.721777916, -17.3190736771]    zte  the caterpillar

Accumulation based on Annotations

This feature enable (1) the annotation of relations with tags (e.g. domains) and (2) implements the accumulation of metrics based on these:

results = DatasetResults.from_path("tests/test_data/new_style_results_with_mistakes")

results.update_relation_info({"example_1": {"domain": ("a", "b", "c")}, "example_2": {"domain": ("b", "d")}})

results.get_metrics(["accuracy"], accumulate="domain", explode=True)
        accuracy  support
domain                   
a       1.000000      3.0
b       0.666667      9.0
c       1.000000      3.0
d       0.500000      6.0

Membership Test

This features allows the user to test whether a relation (i.e. a relation code) is part of the dataset:

from lm_pub_quiz import Dataset

bear = Dataset.from_name("bear")

print("P6" in bear)  # True
print("P0" in bear)  # False