nikitakit / self-attentive-parser

High-accuracy NLP parser with models for 11 languages.
https://parser.kitaev.io/
MIT License
872 stars 153 forks source link

How are F1 scores calculated? #100

Open AngledLuffa opened 1 year ago

AngledLuffa commented 1 year ago

Stanza maintainer from across the bay. We have a constituency parser which is doing pretty well, but I am not sure what evaluation to use to match the leaderboard scores. Eg, what numbers do people actually report? In the case of benepar, is there any secret sauce to getting the F1 scores reported in the chart?

In the "available models" chart, benepar_en3_large has an F1 of 96.29. I ran it on each of the sentences in the revised PTB as follows:

import benepar, spacy
from spacy.tokens import Doc
from stanza.models.constituency import tree_reader
from spacy.language import Language
from tqdm import tqdm

@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
    doc[token.i + 1].is_sent_start = False
    return doc

# I'm sure there's a tree reader in spacy or benepar somewhere, but I just used what I had available
test_set = tree_reader.read_treebank("data/constituency/en_ptb3-revised_test.mrg")

nlp = spacy.load('en_core_web_trf')
# turns off sentence splitting, which otherwise breaks up sentence 88 and others
nlp.add_pipe("set_custom_boundaries", before="parser")
nlp.add_pipe("benepar", config={"model": "benepar_en3_large"})

with open("benepar_out.mrg", "w") as fout:
    for tree in tqdm(test_set):
        sentence = tree.leaf_labels()
        doc = Doc(nlp.vocab, sentence, sent_starts=[1] + [0] * (len(sentence) - 1))
        doc = nlp(doc)
        sents = list(doc.sents)
        assert len(sents) == 1
        sent = sents[0]
    parse = sent._.parse_string.replace("-LCB-)", "{)").replace("-RCB-)", "})")
        fout.write("(ROOT {})\n".format(parse))

This is using spaCy predicted tags, I believe, for what that's worth. Although I note that switching to en_core_web_md has no effect on the POS tags, so maybe it's not using spaCy tags after all. If that's not the tags you used for training, would you let me know which ones so I can better match the performance?

This outputs a file which looks the same as the input file. I ran that through evalb, using "the latest version" from here: https://nlp.cs.nyu.edu/evalb/ The result is 95.66. As part of Stanford CoreNLP, we have an evalb implementation in Java which drops punctuation nodes when counting the brackets and collapses PRT into ADVP... personally I wouldn't think those are still necessary in this day and age, but when I feed it the benepar results, I get 96.12. Much closer, possibly if the POS tags aren't exact that's the entire difference.

Is there something different in this process which would get back the reported score of 96.29? Do you know what the other leaderboard scores (such as on nlpprogress.com) did to get their results? The top several papers each just mention evalb, whereas your paper says All values are F1 scores calculated using the version of evalb distributed with the shared task., and I wonder if there's some technical difference in the program or if the standard leaderboard paper also removes punct, for example.

Thanks in advance. I would hate to report a score in any way which was not produced the same way as other reported scores.

AngledLuffa commented 1 year ago

My PI points out that COLLINS.pm is part of evalb and

But the version of evalb labeled "the latest" has an issue in it where, if a model mistags a quote as an -LRB- for example, it does get deleted from the gold tree but doesn't get deleted from the pred tree. It seems somehow this bug in evalb might have regressed?

David Ellis (Brown University) : fixes a bug in which sentences were incorrectly categorized as "length mismatch" when the the parse output had certain mislabeled parts-of-speech.

How did you work around that, if at all? Feed gold tags into the parser model?