snowblink14 / smatch

Smatch tool: evaluation of AMR semantic structures
MIT License
63 stars 27 forks source link

interpretation of TOP property #25

Closed oepen closed 4 years ago

oepen commented 4 years ago

when comparing SMATCH results with the scorer from the 2019 CoNLL Shared Task on Cross-Framework Meaning Representation Parsing (MRP), we discovered that SMATCH will only consider the TOP property correct if the node labels also match. this appears to double-penalize for label mismatches and is maybe not the intended behavior? for more technical detail and a minimal test case, please see the MRP mtool issue.

goodmami commented 4 years ago

Good catch. Here is an even-more-minimal test case:

$ cat i25a
(a / alpha
   :ARG0 (b / beta))
$ cat i25b  # only top concept different
(a / alternative
   :ARG0 (b / beta))
$ cat i25c  # only non-top concept different
(a / alpha
   :ARG0 (b / b-side))
$ python smatch.py --pr -f i25a i25b
Precision: 0.50
Recall: 0.50
F-score: 0.50
$ python smatch.py --pr -f i25a i25c
Precision: 0.75
Recall: 0.75
F-score: 0.75

When smatch is run with -v it shows how it considers the top triple an "attribute":

[...]
Instance triples of AMR 1: 2
[('instance', 'a0', 'alpha'), ('instance', 'a1', 'beta')]
Attribute triples of AMR 1: 1
[('TOP', 'a0', 'alpha')]
[...]

Note that the concept alpha is part of that TOP triple, when all it should probably do is indicate the top node. I tried making the attribute triple into a simple cycle on the top node (i.e., so the triple looks like ('TOP', 'a0', 'a0')) and seem to have resolved the error difference:

--- a/amr.py
+++ b/amr.py
@@ -418,7 +418,7 @@ class AMR(object):
             relation_list.append(node_rel_list)
             attribute_list.append(node_attr_list)
         # add TOP as an attribute. The attribute value is the top node value
-        attribute_list[0].append(["TOP", node_value_list[0]])
+        relation_list[0].append(["TOP", node_name_list[0]])
         result_amr = AMR(node_name_list, node_value_list, relation_list, attribute_list)
         return result_amr

Results in:

$ python smatch.py --pr -f i25a i25b
Precision: 0.75
Recall: 0.75
F-score: 0.75

Now it gets the score of 0.75 for both when the top label is different (i25b) and when the non-top label is different (i25c). In addition, it gets 0.75 when i25a is compared to the following variant which has the same triples but a different top:

(b / beta
   :ARG0-of (a / alpha))
oepen commented 4 years ago

i believe it has not consequences for the overall F1 score, but in my view the TOP property would better be considered a node attribute rather than a relation between a node and itself. treating it as an attribute would avoid (vacuously) stipulating a self-loop in the graph, and it preserves a one-to-one correspondence between relation triples and actual graph edges.

goodmami commented 4 years ago

If that matters to you, then here's a diff that creates an attribute triple instead. It is probably equally unlikely to be confused with an actual graph triple (since we have no role called TOP). It leads to the same scores reported above for the TOP relation triple, at least for the 4 test files defined above.

--- a/amr.py
+++ b/amr.py
@@ -421,7 +421,7 @@ class AMR(object):
             relation_list.append(node_rel_list)
             attribute_list.append(node_attr_list)
         # add TOP as an attribute. The attribute value is the top node value
-        attribute_list[0].append(["TOP", node_value_list[0]])
+        attribute_list[0].append(["TOP", 'top'])
         result_amr = AMR(node_name_list, node_value_list, relation_list, attribute_list)
         return result_amr
oepen commented 4 years ago

hi again, @goodmami: the above looks just right to me (treating the TOP property as an attribute with a vacuous value). would you be set up to submit that as a pull request?

i am wondering whether we could try to work through some more of the known issues and hope to arrive at a re-release of an improved SMATCH sometime over the next few weeks?

goodmami commented 4 years ago

Yes, I can put together a PR for that, but for a re-release what do people think of replacing amr.py with Penman? Doing so would solve several existing issues here. It's on PyPI, and I've recently released a 1.0 version to go with my ACL 2020 demo of the library. The 1.0 version isn't so much a big change from the previous version as it is a commitment to support its current API.