Open AngledLuffa opened 1 year ago
My PI points out that COLLINS.pm is part of evalb and
But the version of evalb labeled "the latest" has an issue in it where, if a model mistags a quote as an -LRB-
for example, it does get deleted from the gold tree but doesn't get deleted from the pred tree. It seems somehow this bug in evalb might have regressed?
David Ellis (Brown University) : fixes a bug in which sentences were incorrectly categorized as "length mismatch" when the the parse output had certain mislabeled parts-of-speech.
How did you work around that, if at all? Feed gold tags into the parser model?
Stanza maintainer from across the bay. We have a constituency parser which is doing pretty well, but I am not sure what evaluation to use to match the leaderboard scores. Eg, what numbers do people actually report? In the case of benepar, is there any secret sauce to getting the F1 scores reported in the chart?
In the "available models" chart,
benepar_en3_large
has an F1 of 96.29. I ran it on each of the sentences in the revised PTB as follows:This is using spaCy predicted tags, I believe, for what that's worth. Although I note that switching to
en_core_web_md
has no effect on the POS tags, so maybe it's not using spaCy tags after all. If that's not the tags you used for training, would you let me know which ones so I can better match the performance?This outputs a file which looks the same as the input file. I ran that through evalb, using "the latest version" from here: https://nlp.cs.nyu.edu/evalb/ The result is 95.66. As part of Stanford CoreNLP, we have an evalb implementation in Java which drops punctuation nodes when counting the brackets and collapses PRT into ADVP... personally I wouldn't think those are still necessary in this day and age, but when I feed it the benepar results, I get 96.12. Much closer, possibly if the POS tags aren't exact that's the entire difference.
Is there something different in this process which would get back the reported score of 96.29? Do you know what the other leaderboard scores (such as on nlpprogress.com) did to get their results? The top several papers each just mention
evalb
, whereas your paper saysAll values are F1 scores calculated using the version of evalb distributed with the shared task.
, and I wonder if there's some technical difference in the program or if the standard leaderboard paper also removes punct, for example.Thanks in advance. I would hate to report a score in any way which was not produced the same way as other reported scores.