shtoshni / fast-coref

Code for the CRAC 2021 paper "On Generalization in Coreference Resolution" (Best short paper award)
31 stars 13 forks source link

On Generalization in Coreference Resolution

Code for the CRAC 2021 paper On Generalization in Coreference Resolution. This paper extends our work from the EMNLP 2020 paper Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks.

June 2022: Performace has improved!!

Our current model gets 80.9 F-score for OntoNotes (80.6 reported in the paper), 80.2 F-score for LitBank (79.3 reported in the paper), and 88.3 F-score for PreCo (up from 87.8 reported in the paper).

Why are we getting these gains?
Well pretty much all of this gain can be attributed to this issue. I, like many others, had carried forward the Kenton Lee codebase where the spans (due to the choice of ElMo as an encoder) were always restricted to word boundaries. Interestingly, while porting the code to the new era of subword tokenization based encoders, we didn't constrain the mention detector to respect word boundaries. By the end of the training, the model rarely makes these word boundary mistakes (it can which is why the above referenced issue was raised) but it does have to deal with a lot of noisy mentions in the mention proposal stage. By simply adding the constraint of word boundaries, there's a significant reduction in the number of candidate mentions which ultimately leads to higher overall performance.

Changelog

Resources

Environment Setup

Install Requirements

The codebase has been tested for:

python==3.8.8
torch==1.10.0
transformers==4.11.3
scipy==1.6.3
omegaconf==2.1.1
hydra-core==1.1.1
wandb==0.12.6

These are the core requirements which can be separately installed or just run:

pip install -r requirements.txt

Clone a few Github Repos (including this!)

# Clone this repo
git clone https://github.com/shtoshni/fast-coref

# Create a coref resources directory which contains the official 
# scorers and the data
mkdir coref_resources; cd coref_resources/
git clone https://github.com/conll/reference-coreference-scorers

# Create data subdirectory in the resources directory
mkdir data

Data Preparation

cd fast-coref/src
export PYTHONPATH=.

# Demonstrating the data preparation step for QuizBowl.
# Here we point to the CoNLL directory extracted from the original data
# Output directory is created in the parent directory i.e. 
# ../../coref_resources/data/quizbowl/longformer
python data_processing/process_quizbowl.py ../../coref_resources/data/quizbowl/conll

Configurations

The config files are located in src/conf.
All the experiment configs are located in src/conf/experiment subdirectory.

Path strings are limited to the experiment configs and the main src/conf/config.yaml file. These paths can be manually edited, or overriden via command line.

Note
The default configs correspond to the configs used for training the models reported in the CRAC paper. All models are trained for a maximum of 100K steps.

The only exception is PreCo (Wandb log) where we experimented with more training steps (150K instead of 100K). But even there, the best validation performance is obtained at 60K steps and the training stops at 110K (after 10 evals without improvement).

Training and Inference

cd fast-coref/src
export PYTHONPATH=.

Training

Here are a few training commands I've used.

Joint training with wandb logging

python main.py experiment=joint use_wandb=True

LitBank training without label smoothing

python main.py experiment=litbank trainer.label_smoothing_wt=0.0

OntoNotes training with pseudo-singletons and longformer-base

python main.py experiment=ontonotes_pseudo model/doc_encoder/transformer=longformer_base

LitBank training with bounded memory model (memory size 20)

python main.py experiment=litbank model/memory/mem_type=learned model.memory.mem_type.max_ents=20

Note
The model is saved in two parts. The document encoder and all the remaining parametes are saved separately. The document encoder can then be easily uploaded to Huggingface.

Inference

Inference on OntoNotes with model trained on OntoNotes

python main.py experiment=ontonotes_pseudo train=False

This is the most common use case where the training and inference domain are the same. The inference is automatically carried out whenever the training ends. The above command is useful if the inference needs to be done at some intermediate step.

Inference on OntoNotes evaluation dataset with the jointly trained model

python main.py experiment=ontonotes_pseudo paths.model_dir=../models/joint_best/ train=False

Evaluate all the datasets + Use the huggingface uploaded document encoder

python main.py experiment=eval_all paths.model_dir=../models/check_ontonotes/ model/doc_encoder/transformer=longformer_ontonotes override_encoder=True

The longformer_ontonotes model corresponds to the shtoshni/longformer_coreference_ontonotes encoder uploaded by us.

Miscellaneous

Why this repository name?

Marketing, Self-boasting There are three-fold reasons for this:

There are a lot of engineering hacks such as lower precision models, better implementation choices, etc., which can further improve the model's speed.

Why Hydra for configs?

It took me a few days to get a hang of Hydra but I highly recommend Hydra for maintaining configs. A few selling points of Hydra which made me persist:

Citation

@inproceedings{toshniwal2021generalization,
    title = {{On Generalization in Coreference Resolution}},
    author = "Shubham Toshniwal and Patrick Xia and Sam Wiseman and Karen Livescu and Kevin Gimpel",
    booktitle = "CRAC (EMNLP)",
    year = "2021",
}

@inproceedings{toshniwal2020bounded,
    title = {{Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks}},
    author = "Shubham Toshniwal and Sam Wiseman and Allyson Ettinger and Karen Livescu and Kevin Gimpel",
    booktitle = "EMNLP",
    year = "2020",
}