PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.
Train on spanbert large, but get F1 1 point lower than presented in paprer #16

I use spanbert large model with default parameters in config file, and I get Avg F1 78.27, lower than Avg.F1 79.9 in paper. config as following:

num_docs = 2802 bert_learning_rate = 1e-05 task_learning_rate = 0.0003 max_segment_len = 512 ffnn_size = 3000 cluster_ffnn_size = 3000 max_training_sentences = 3 bert_tokenizer_name = bert-base-cased

max_top_antecedents = 50 max_training_sentences = 5 top_span_ratio = 0.4 max_num_extracted_spans = 3900 max_num_speakers = 20 max_segment_len = 256


bert_learning_rate = 1e-5 task_learning_rate = 2e-4 loss_type = marginalized # {marginalized, hinge} mention_loss_coef = 0 false_new_delta = 1.5 # For loss_type = hinge adam_eps = 1e-6 adam_weight_decay = 1e-2 warmup_ratio = 0.1 max_grad_norm = 1 # Set 0 to disable clipping gradient_accumulation_steps = 1

Model hyperparameters.

coref_depth = 1 # when 1: no higher order (except for cluster_merging) higher_order = attended_antecedent # {attended_antecedent, max_antecedent, entity_equalization, span_clustering, cluster_merging} coarse_to_fine = true fine_grained = true dropout_rate = 0.3 ffnn_size = 1000 ffnn_depth = 1 cluster_ffnn_size = 1000 # For cluster_merging cluster_reduce = mean # For cluster_merging easy_cluster_first = false # For cluster_merging cluster_dloss = false # cluster_merging num_epochs = 24 feature_emb_size = 20 max_span_width = 30 use_metadata = true use_features = true use_segment_distance = true model_heads = true use_width_prior = true # For mention score use_distance_prior = true # For mention-ranking score


conll_eval_path = dev.english.v4_gold_conll # gold_conll file for dev conll_test_path = test.english.v4_gold_conll # gold_conll file for test genres = ["bc", "bn", "mz", "nw", "pt", "tc", "wb"] eval_frequency = 1000 report_frequency = 100

Hi @yangjingyi. I am getting ~76 average F1 score using spanbert_large (bert_pretrained_name_or_path = SpanBERT/spanbert-large-cased). I am getting ~76 F1 both with training a new spanbert_large model and also with evaluating the model given in this repo (https://cs.emory.edu/~lxu85/train_spanbert_large_ml0_d2.tar) by using "train_spanbert_large_ml0_d2" config i.e. coref_depth=2 (+AA) for being equivalent to Joshi et al. 2020. Were you able to reproduce ~79 avg F1 score by just evaluating the model provided in this repo?

@lxucs do you know any possible reasons for not being able to reproduce spanbert (+AA) scores (using the model provided here https://cs.emory.edu/~lxu85/train_spanbert_large_ml0_d2.tar)?

It turned out that the data needs to be tokenized using "bert-base-cased" with the model as "SpanBERT/spanbert-large-cased". I am able to reproduce ~79 now. Earlier I was using "SpanBERT/spanbert-large-cased" for tokenization also (which gave ~76).