How are F1 scores calculated?

Stanza maintainer from across the bay. We have a constituency parser which is doing pretty well, but I am not sure what evaluation to use to match the leaderboard scores. Eg, what numbers do people actually report? In the case of benepar, is there any secret sauce to getting the F1 scores reported in the chart?

In the "available models" chart, benepar_en3_large has an F1 of 96.29. I ran it on each of the sentences in the revised PTB as follows:

import benepar, spacy
from spacy.tokens import Doc
from stanza.models.constituency import tree_reader
from spacy.language import Language
from tqdm import tqdm

@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
    doc[token.i + 1].is_sent_start = False
    return doc

# I'm sure there's a tree reader in spacy or benepar somewhere, but I just used what I had available
test_set = tree_reader.read_treebank("data/constituency/en_ptb3-revised_test.mrg")

nlp = spacy.load('en_core_web_trf')
# turns off sentence splitting, which otherwise breaks up sentence 88 and others
nlp.add_pipe("set_custom_boundaries", before="parser")
nlp.add_pipe("benepar", config={"model": "benepar_en3_large"})

with open("benepar_out.mrg", "w") as fout:
    for tree in tqdm(test_set):
        sentence = tree.leaf_labels()
        doc = Doc(nlp.vocab, sentence, sent_starts=[1] + [0] * (len(sentence) - 1))
        doc = nlp(doc)
        sents = list(doc.sents)
        assert len(sents) == 1
        sent = sents[0]
    parse = sent._.parse_string.replace("-LCB-)", "{)").replace("-RCB-)", "})")
        fout.write("(ROOT {})\n".format(parse))

This is using spaCy predicted tags, I believe, for what that's worth. Although I note that switching to en_core_web_md has no effect on the POS tags, so maybe it's not using spaCy tags after all. If that's not the tags you used for training, would you let me know which ones so I can better match the performance?

This outputs a file which looks the same as the input file. I ran that through evalb, using "the latest version" from here: https://nlp.cs.nyu.edu/evalb/ The result is 95.66. As part of Stanford CoreNLP, we have an evalb implementation in Java which drops punctuation nodes when counting the brackets and collapses PRT into ADVP... personally I wouldn't think those are still necessary in this day and age, but when I feed it the benepar results, I get 96.12. Much closer, possibly if the POS tags aren't exact that's the entire difference.

Is there something different in this process which would get back the reported score of 96.29? Do you know what the other leaderboard scores (such as on nlpprogress.com) did to get their results? The top several papers each just mention evalb, whereas your paper says All values are F1 scores calculated using the version of evalb distributed with the shared task., and I wonder if there's some technical difference in the program or if the standard leaderboard paper also removes punct, for example.

Thanks in advance. I would hate to report a score in any way which was not produced the same way as other reported scores.

nikitakit / self-attentive-parser

How are F1 scores calculated? #100