snowblink14 / smatch

Smatch tool: evaluation of AMR semantic structures
MIT License
63 stars 27 forks source link

1.11 F1 score #15

Open miguelballesteros opened 6 years ago

miguelballesteros commented 6 years ago

I tried evaluating a single sentence against itself and I got a Smatch score greater than one (!!), any idea why? Thank you.

Details below:

python smatchnew/smatch/smatch.py -f q3.txt q3.txt F-score: 1.11

cat q3.txt
# ::snt How many white settlers were living in Kenya in the 1950's ?
(l / live-01
      :ARG0 (p / person
            :ARG1-of (s / settle-03
                  :ARG1 p
                  :ARG4 c)
            :ARG1-of (w / white-02)
            :quant (a / amr-unknown))
      :location (c / country :name "Kenya")
      :time (d / date-entity :decade 1950))
snowblink14 commented 6 years ago

@miguelballesteros I think it's because smatch has the assumption that the same triple can only occur once. In your example, you have:ARG0 (p / person :ARG1-of (s / settle-03 :ARG1 p, which results in two same triples <ARG1, settle, person>.

I am not sure if this duplication is a mistake or intended behavior, but we could add something to fix if there are more than one same triples.

If you remove :ARG1 p from your example, the score will be 1.0.

miguelballesteros commented 6 years ago

I see, makes sense. I understand that this needs to occur both in the gold graph and in the predicted graph; if it only happens in the predicted graph it wont have that effect, is this right?

snowblink14 commented 6 years ago

Currently smatch treats gold graph and predicted graph equally, so if the duplication happens in the predicted graph it will also cause some overcounting. Before a fix is applied, a workaround is to check if there are duplicate triples in your graphs.

goodmami commented 4 years ago

The duplicate issue in #28 highlights that this is still causing headaches. I think the problem here is that the duplicated triple is counted in the numerator and not the denominator. There are 3 other arrangements we could consider:

  1. Count duplicated edges in the numerator but not the denominator (leading to a score > 1.0 if everything else is correct) (current situation)
  2. Ignore duplicated edges entirely (leading to a score of 1.0 if everything else is correct)
  3. Count duplicated edges in the denominator but not the numerator (leading to a score < 1.0)
  4. Count duplicated edges in both (leading to a score < 1.0, but approaching 1.0 the more duplicated edges there are)

I think 3 leaves open the door for gaming the metric; users can pad their AMRs with edges they are confident about. Both 1 and 2 are ok, but I have a preference for 2 as these duplicated edges are bad AMRs (see https://github.com/amrisi/amr-guidelines/issues/93 and https://github.com/amrisi/amr-guidelines/issues/121) and the badness should be reflected in the score.

ramon-astudillo commented 4 years ago

I agree that 2 makes the most sense.

oepen commented 4 years ago

SMATCH computes F1 scores, so there should be no uncertainty about what is the correct definition here.

there could be ‘duplicate’ tuples in either the gold or the system graph, or both. all should be counted equally, i.e. some will be correct (in both graphs), some maybe only in one or the other. i believe the right solution will be more in the spirit of your option 3. rather than 2. the potential for ‘gaming’ scores, to me, seems to presuppose that one can change the gold-standard target graph?

assuming a fixed ‘gold’ graph, say it contains two ‘duplicate’ triples (h : mode interrogative). a parser output with two such triples will yield perfect precision and recall; missing for example one of them will reduce recall, whereas padding with extra :mode triples would penalize precision.

oepen commented 4 years ago

one more comment on the legitimacy of ‘duplicate’ triples. the original issue was about multiple edges with the same label (roles), and the discussions among AMR developers that you dug up, @goodmami, seem to lean towards outlawing those.

but i think it is a new observation by @ramon-astudillo (in #28) that the same over-counting problem also applies to constant-valued node properties (attributes). there are some legitimate instances of multiple occurrences of the same attribute in the latest AMR release, pointed out to me by @timjogorman last year (look for :li):

"d)Finaly, which shop/website do you recommend and some buying advise would be realy good plz!!"
(a3 / and :polite + :li "d" :li "-1"
    :op1 (r / recommend-01
          :ARG0 (y / you)
          :ARG4 (a / amr-unknown
                :domain (s / slash
                      :op1 (s2 / shop)
                      :op2 (w / website))))
    :op2 (g / good-02
          :ARG1 (t / thing
                :ARG2-of (a2 / advise-01)
                :purpose (b / buy-01)
                :quant (s3 / some))
          :ARG1-of (r2 / real-04)))
tahira commented 4 years ago

@oepen those :li attributes seem to have different values ... not the same value duplicated

ramon-astudillo commented 4 years ago

In any case, after readings @oepen comment, I agree that it may be best to consider repetitions in the gold AMR as such and penalize having either a higher or lower count on the predicted AMR (closer to option 3).

Even if right now such repetitions in AMR wont happen, this is closer to a pure F1. It would also support repetitions if needed (but only if they are present in the gold AMR).

tahira commented 4 years ago

2 seems simpler ... and harmless if multiple triples are not allowed according to the AMR guidelines .... but @oepen 's is suggesting a sophisticated version of 3 that counts only as many duplicated triples in the numerator as are present in the gold, but not the rest ... that should mean more change to the code but would not be relying on gold graphs always sticking to 'no-duplicate-triples'

oepen commented 4 years ago

yes, so only an example of (arguably) motivated repetition of attributes. i struggle to suggest a linguistically plausible graph where the same attribute would repeatedly have the same value. but once multiple occurrences of an attribute are legitimate, and their values are arbitrary constants ... there is no way to prevent a parser (or possibly annotator) from creating wholly ‘duplicate’ triples. my general view is that these are not technically duplicates, just multiple tokens of the same tuple type.

goodmami commented 4 years ago

@oepen you make a fair point about keeping the metric a correct implementation of F1. It seems like you may have been misinterpreting about what is meant by duplicate triple. In this case we're not talking merely about source node and role, but the full triple, so in the original issue the triple ARG1(s, p) appears twice. Also it doesn't matter if it's attribute triples or node-to-node triples. Here are examples of both:

$ cat i15.gold  # e.g., "Ethiopian coffee is very good."
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia")))
   :degree (v / very))
$ cat i15.test-a  # duplicated degree(g, v)
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia")))
   :degree (v / very)
   :degree v)
$ cat i15.test-b  # duplicated op1(n, "Ethiopia")
(g / good
   :ARG1 (c / coffee
      :source (c2 / country
         :name (n / name :op1 "Ethiopia" :op1 "Ethiopia")))
   :degree (v / very))
$ python3 smatch.py -f i15.test-a i15.gold 
F-score: 1.04
$ python3 smatch.py -f i15.test-b i15.gold 
F-score: 1.04

I think @ramon-astudillo and @tahira123's suggestions are good. We need more sophisticated counting/matching of the triples so that matching triples are paired off and removed from consideration from further pairings, or, alternatively, that we look for the same counts of matching triples.

I've come around to liking this solution better than (2) above, since it leaves open the possibility for legitimate duplicates in the gold graph. Thanks for the discussion, everyone.

snowblink14 commented 4 years ago

Thanks for the nice discussions. About the duplicate triples, I agree that although legitimate repetition of triples are arguable at this moment, it will be nice to support them instead of ruling them out.

I propose the following code changes: 1) Check and output warning messages if either predicted/gold AMR has duplicate triples. 2) Add an option to customize how to treat duplicate triples. By default we treat the duplicate triples in gold as legitimate, and predicted graph can only get 1.0 score if matching the gold amr entirely (including the number of duplicate triples) but user can specify options to ignore the duplicates, .etc.

How does this sound?

BramVanroy commented 1 year ago

@snowblink14 Sounds great. Especially 2 seems good for many use-cases. Any progress in this regard?

flipz357 commented 1 year ago

Hi all,

just adding a comment on this issue, maybe this helps someone. Imo this issue here is quite a problem since in combination with standard micro scoring it becomes a real vulnerability, see my blog article.

However, the good news is that there are straightforward solutions to this problem, as implemented in Smatch++:

  1. default: AMR graph is standardized and duplicates edges are removed (since they don't really add information). I see that this solution was also proposed somewhere in the thread above.

  2. optional: keep duplicate edges but use proper scoring. Full credits go to my colleague Julius Steen who found that the solution to this issue is rather simple by using a counter / count dict in creating the weight dict and if two edges match we save minimum count of a duplicate edge of two graphs (since that's the proper max matching count) before alignment. This way duplicates are allowed but the final score is proper.

If you want to have this also in this Smatch library, I think implementation should be rather simple. For now the feature is available in Smatch++.