snowblink14 / smatch

Smatch tool: evaluation of AMR semantic structures
MIT License
63 stars 27 forks source link

Smatch is non-deterministic and does not yield score=1 for the same input/output graph #43

Open BramVanroy opened 1 year ago

BramVanroy commented 1 year ago

I found two quirks in the following example.

  1. even though the smatch score is calculated between the same graph, the f1 score is not 1 (far from it)
  2. the scores are non-deterministic. Sometimes the score is 0.9, then 0.92, then 0.87, etc.

Perhaps something is wrong with my calculate_smatch function, but I do not think so. (It is modified from score_amr_pairs.)

from typing import List

import smatch

def calculate_smatch(refs_penman: List[str], preds_penman: List[str]):
    total_match_num = total_test_num = total_gold_num = 0
    n_invalid = 0

    for sentid, (ref_penman, pred_penman) in enumerate(zip(refs_penman, preds_penman), 1):
        best_match_num, test_triple_num, gold_triple_num = smatch.get_amr_match(
            ref_penman, pred_penman, sent_num=sentid
        )

        total_match_num += best_match_num
        total_test_num += test_triple_num
        total_gold_num += gold_triple_num
        # clear the matching triple dictionary for the next AMR pair
        smatch.match_triple_dict.clear()

    score = smatch.compute_f(total_match_num, total_test_num, total_gold_num)

    return {
        "smatch_precision": score[0],
        "smatch_recall": score[1],
        "smatch_fscore": score[2],
        "ratio_invalid_amrs": n_invalid / len(preds_penman) * 100,
    }

s = """(r / result-01
   :ARG1 (c / compete-01
            :ARG0 (w / woman)
            :mod (p / preliminary)
            :time (t / today)
            :mod (p2 / polo
                     :mod (w2 / water)))
   :ARG2 (a / and
            :op1 (d / defeat-01
                    :ARG0 (t2 / team
                              :mod (c2 / country
                                       :wiki +
                                       :name (n / name
                                                :op1 "Hungary")))
                    :ARG1 (t3 / team
                              :mod (c3 / country
                                       :wiki +
                                       :name (n2 / name
                                                 :op1 "Canada")))
                    :quant (s / score-entity
                              :op1 13
                              :op2 7))
            :op2 (d2 / defeat-01
                     :ARG0 (t4 / team
                               :mod (c4 / country
                                        :wiki +
                                        :name (n3 / name
                                                  :op1 "France")))
                     :ARG1 (t5 / team
                               :mod (c5 / country
                                        :wiki +
                                        :name (n4 / name
                                                  :op1 "Brazil")))
                     :quant (s2 / score-entity
                                :op1 10
                                :op2 9))
            :op3 (d3 / defeat-01
                     :ARG0 (t6 / team
                               :mod (c6 / country
                                        :wiki +
                                        :name (n5 / name
                                                  :op1 "Australia")))
                     :ARG1 (t7 / team
                               :mod (c7 / country
                                        :wiki +
                                        :name (n6 / name
                                                  :op1 "Germany")))
                     :quant (s3 / score-entity
                                :op1 10
                                :op2 8))
            :op4 (d4 / defeat-01
                     :ARG0 (t8 / team
                               :mod (c8 / country
                                        :wiki +
                                        :name (n7 / name
                                                  :op1 "Russia")))
                     :ARG1 (t9 / team
                               :mod (c9 / country
                                        :wiki +
                                        :name (n8 / name
                                                  :op1 "Netherlands")))
                     :quant (s4 / score-entity
                                :op1 7
                                :op2 6))
            :op5 (d5 / defeat-01
                     :ARG0 (t10 / team
                                :mod (c10 / country
                                          :wiki +
                                          :name (n9 / name
                                                    :op1 "United"
                                                    :op2 "States")))
                     :ARG1 (t11 / team
                                :mod (c11 / country
                                          :wiki +
                                          :name (n10 / name
                                                     :op1 "Kazakhstan")))
                     :quant (s5 / score-entity
                                :op1 10
                                :op2 5))
            :op6 (d6 / defeat-01
                     :ARG0 (t12 / team
                                :mod (c12 / country
                                          :wiki +
                                          :name (n11 / name
                                                     :op1 "Italy")))
                     :ARG1 (t13 / team
                                :mod (c13 / country
                                          :wiki +
                                          :name (n12 / name
                                                     :op1 "New"
                                                     :op2 "Zealand")))
                     :quant (s6 / score-entity
                                :op1 12
                                :op2 2))))
"""

if __name__ == "__main__":
    for _ in range(5):
        smatch_score = calculate_smatch([s], [s])
        print(smatch_score)

Output

{'smatch_precision': 0.8866666666666667, 'smatch_recall': 0.8866666666666667, 'smatch_fscore': 0.8866666666666667, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.88, 'smatch_recall': 0.88, 'smatch_fscore': 0.88, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.8666666666666667, 'smatch_recall': 0.8666666666666667, 'smatch_fscore': 0.8666666666666667, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.9266666666666666, 'smatch_recall': 0.9266666666666666, 'smatch_fscore': 0.9266666666666666, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.8533333333333334, 'smatch_recall': 0.8533333333333334, 'smatch_fscore': 0.8533333333333335, 'ratio_invalid_amrs': 0.0}

The non-determinism is very worrying to me. If an evaluation metric is not deterministic, how then can we compare systems to each other in a fair way? A difference of 0.92 vs 0.87 is massive for the same input/output.

flipz357 commented 1 year ago

Hi,

I implemented SMATCH++, that also contains an ILP solver. It should be very simple to run.

Please do:

pip install mip==1.13.0
pip install smatchpp

Then simply run:

python -m smatchpp -a largeamr.txt -b largeamr.txt -solver ilp

The output is

F1: 100.0 Precision: 100.0 Recall: 100.0

snowblink14 commented 1 year ago

@BramVanroy the reason your example does not work well is that it contains many similar components, which the hill-climbing implemented in the current smatch package is more likely to have different node matching (like matching the first "team" in the first AMR to the 2nd/3rd.. "team" in the second AMR), thus the different scores.

This is due to the NP-completeness of this semantic graph matching problem. Computing the exact solution is hard and time-consuming, so we employed the hill-climbing method to make this solving faster, but it had the weakness of depending on the initialization.

Unfortunately I no longer have the time to actively work and improve this. Please feel free to try smatch++ mentioned above to see if the speed and accuracy works for your case.

BramVanroy commented 1 year ago

Thanks for the reply @snowblink14. I'm mostly worried about the random differences. Wouldn't it make sense to fix the randomness so that everyone using it experiences the same behavior? Now it becomes quite hard to reproduce results (the example I gave is taken from the frequently used AMR 3.0 corpus).

flipz357 commented 1 year ago

@BramVanroy

Fixing random seed is a hack that doesn't really help with anything. The main problem with the hill-climber is that is has no useful upper-bound (you never know how far off you are from the best solution and thereby not know if you got the best solution).

ILP gives optimal alignment, and yes, it is NP complete, but it works for standard evaluation setup. NP complete also doesn't mean that you will not have a useful score in a hypothetical case where it doesn't find the optimal solution. Even intermediate solutions can be better than hill-climbing (hill-climber deteriorates even more for large graphs), and with ilp you will always have a meaningful upper bound (telling you the quality of the current solution).

You can read more in my paper