Closed oepen closed 4 years ago
Good catch. Here is an even-more-minimal test case:
$ cat i25a
(a / alpha
:ARG0 (b / beta))
$ cat i25b # only top concept different
(a / alternative
:ARG0 (b / beta))
$ cat i25c # only non-top concept different
(a / alpha
:ARG0 (b / b-side))
$ python smatch.py --pr -f i25a i25b
Precision: 0.50
Recall: 0.50
F-score: 0.50
$ python smatch.py --pr -f i25a i25c
Precision: 0.75
Recall: 0.75
F-score: 0.75
When smatch is run with -v
it shows how it considers the top triple an "attribute":
[...]
Instance triples of AMR 1: 2
[('instance', 'a0', 'alpha'), ('instance', 'a1', 'beta')]
Attribute triples of AMR 1: 1
[('TOP', 'a0', 'alpha')]
[...]
Note that the concept alpha
is part of that TOP triple, when all it should probably do is indicate the top node. I tried making the attribute triple into a simple cycle on the top node (i.e., so the triple looks like ('TOP', 'a0', 'a0')
) and seem to have resolved the error difference:
--- a/amr.py
+++ b/amr.py
@@ -418,7 +418,7 @@ class AMR(object):
relation_list.append(node_rel_list)
attribute_list.append(node_attr_list)
# add TOP as an attribute. The attribute value is the top node value
- attribute_list[0].append(["TOP", node_value_list[0]])
+ relation_list[0].append(["TOP", node_name_list[0]])
result_amr = AMR(node_name_list, node_value_list, relation_list, attribute_list)
return result_amr
Results in:
$ python smatch.py --pr -f i25a i25b
Precision: 0.75
Recall: 0.75
F-score: 0.75
Now it gets the score of 0.75 for both when the top label is different (i25b
) and when the non-top label is different (i25c
). In addition, it gets 0.75 when i25a
is compared to the following variant which has the same triples but a different top:
(b / beta
:ARG0-of (a / alpha))
i believe it has not consequences for the overall F1 score, but in my view the TOP property would better be considered a node attribute rather than a relation between a node and itself. treating it as an attribute would avoid (vacuously) stipulating a self-loop in the graph, and it preserves a one-to-one correspondence between relation triples and actual graph edges.
If that matters to you, then here's a diff that creates an attribute triple instead. It is probably equally unlikely to be confused with an actual graph triple (since we have no role called TOP
). It leads to the same scores reported above for the TOP relation triple, at least for the 4 test files defined above.
--- a/amr.py
+++ b/amr.py
@@ -421,7 +421,7 @@ class AMR(object):
relation_list.append(node_rel_list)
attribute_list.append(node_attr_list)
# add TOP as an attribute. The attribute value is the top node value
- attribute_list[0].append(["TOP", node_value_list[0]])
+ attribute_list[0].append(["TOP", 'top'])
result_amr = AMR(node_name_list, node_value_list, relation_list, attribute_list)
return result_amr
hi again, @goodmami: the above looks just right to me (treating the TOP property as an attribute with a vacuous value). would you be set up to submit that as a pull request?
i am wondering whether we could try to work through some more of the known issues and hope to arrive at a re-release of an improved SMATCH sometime over the next few weeks?
Yes, I can put together a PR for that, but for a re-release what do people think of replacing amr.py
with Penman? Doing so would solve several existing issues here. It's on PyPI, and I've recently released a 1.0 version to go with my ACL 2020 demo of the library. The 1.0 version isn't so much a big change from the previous version as it is a commitment to support its current API.
when comparing SMATCH results with the scorer from the 2019 CoNLL Shared Task on Cross-Framework Meaning Representation Parsing (MRP), we discovered that SMATCH will only consider the TOP property correct if the node labels also match. this appears to double-penalize for label mismatches and is maybe not the intended behavior? for more technical detail and a minimal test case, please see the MRP
mtool
issue.