goodbai-nlp / AMRBART

Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022
MIT License
92 stars 28 forks source link

0.3 lower Smatch score using amr-evaluation-enhanced #14

Closed hankcs closed 1 year ago

hankcs commented 1 year ago

Dear authors,

Thank you for sharing your work, it's amazing. I just want to share a finding regarding the parsing evaluation. As far as I know, many existing works (like Cai & Lam, ACL 2020) are utilizing amr-evaluation-enhanced for computing Smatch score. After running this script on your parsing output, 84.0 was returned, which is slightly lower than the score 84.3 you reported. I ran it multiple times and the result remained the same:

$ bash evaluation.sh data/model/amr3/bartamr/AMR3.0-test-pred-wiki.amr data/amr/amr_3.0/test.txt

Smatch -> P: 0.844, R: 0.836, F: 0.840
Unlabeled -> P: 0.867, R: 0.858, F: 0.862
No WSD -> P: 0.849, R: 0.841, F: 0.845
Non_sense_frames -> P: 0.918, R: 0.916, F: 0.917
Wikification -> P: 0.836, R: 0.817, F: 0.826
Named Ent. -> P: 0.893, R: 0.874, F: 0.884
Negations -> P: 0.716, R: 0.722, F: 0.719
IgnoreVars -> P: 0.746, R: 0.742, F: 0.744
Concepts -> P: 0.907, R: 0.900, F: 0.903
Frames -> P: 0.888, R: 0.885, F: 0.887
Reentrancies -> P: 0.721, R: 0.729, F: 0.725
SRL -> P: 0.801, R: 0.807, F: 0.804

I understand Smatch is using a stochastic matching algorithm and 0.3 is not significant at all. Just want to share this little finding with the community. Maybe we should migrate to the amrlib.evaluate.smatch_enhanced package you used for better comparision.

goodbai-nlp commented 1 year ago

Hi hankcs,

Thanks for your comments. We also notice that the Smatch scripts in amrlib and amr-evaluation-enhanced give different scores, although amrlib claims that the code are from amr-evaluation-enhanced here. We will report both scores in README to ensure a fair comparison with previous work.

hankcs commented 1 year ago

Thank you, cheers!

bjascob commented 7 months ago

I looked into this a bit, and it looks like amrlib is consistent with the latest snowblink14/smatch code (both the 1.0.4 release and the current master). The ChunchuanLv/amr-evaluation-tool-enhanced consistently gives 0.3 lower scores (as noted in the title) on the 2 models I tested.

All of of these libs use snowblink14/smatch under the hood for the algorithm, but ChunchuanLv/amr-evaluation-tool-enhanced uses an older python2 version that is very hacked up. I'm guessing that later code from snowblink14 (which is now python3) fixed something. However, I didn't dig through their issues list or try to regression test the code to find out what changed.

In short, I suspect the higher numbers are the "right" scores.

BTW... I tried running the test multiple times and you only see variation in the 4th digit. ie..

Precision: 0.83471, Recall: 0.81098, F-score: 0.82267
Precision: 0.83476, Recall: 0.81103, F-score: 0.82272
Precision: 0.83483, Recall: 0.81109, F-score: 0.82279

So it's unlikely you'd see variation when only printing 3, unless you happen to be right on the line.

I'll also pass on a comment from nschneid... There's also Smatch++, which uses an ILP solver instead of hill-climbing, so it is supposed to be more accurate. https://github.com/flipz357/smatchpp by @flipz357

flipz357 commented 7 months ago

Hi,

I think I can explain some of it.

The lower score from the old(er) Smatch script is mainly due to penalizing if the root concept is wrong. Turns out that the penalty is actually desired since the root concept is the focus of an AMR. See also context for more explanation this recent issue.

So, no, higher numbers are not right scores.

Importantly, to simply sum up:

Higher numbers due to optimal solving (SMATCH++): higher numbers are right

Higher numbers due to ignoring AMR root concept: higher numbers are wrong


A bit more on the variation

The differences:

Precision: 0.83471, Recall: 0.81098, F-score: 0.82267
Precision: 0.83476, Recall: 0.81103, F-score: 0.82272
Precision: 0.83483, Recall: 0.81109, F-score: 0.82279

seem very little and one would think that one shouldn't bother (I used to think like this too), but the issue is that it can be that the real Smatch score is not similar to these numbers. These numbers are result of heuristic. Sometimes it may solve some graph pairs better, and other times other graph pairs. So, on average this can mean little variation. But if you would solve all graph pairs optimally the real Smatch score may be different.

bjascob commented 7 months ago

You're making the argument that your smatchcpp code is better.

This posting is talking about the difference between some older python2 snowblink14/smatch scoring code and the last release of snowblink14/smatch code.

flipz357 commented 7 months ago

Hi @bjascob

if you click the link in my comment, and read the comments on the issue, it exactly explains the difference between the older smatch code and the newer one. (And I also described it in the post above...)

bjascob commented 7 months ago

Just to be clear. You're saying the older python2 versions of the snowblink14/smatch library are more accurate than the current python3 version of that lib.

flipz357 commented 7 months ago

Yes.

The older Smatch version

  1. considers AMR root in a better way

  2. it also deletes duplicate triples which the "python3 version" handles wrongly. With the wrong handling, the duplicate triples can actually (and funnily) inflate the evaluation score up to 100 Smatch points.

If needed/interested, I can show how to reproduce both problems.

PS @bjascob : Just so that we're on the same board, with "older Smatch" version I mean amr-evaluation-enhanced, With the newer Smatch version I mean this "original" smatch, since it received the most recent update of the two. Sorry for using the terms "old"/"new", I see this may have caused some confusion.

flipz357 commented 5 months ago

@hankcs seeing this now, just a very minor correction (but I think it's important):

You said "Smatch is using a stochastic matching algorithm".

It's not really right to say this.

Rather, Smatch is a computational problem of graph matching/alignment.

The Smatch library that's been used most of the time just happens to employ a stochastic heuristic algorithm for solving the problem (which is bad, actually).