This repository contains the source code of CorPipe, which is available under the MPL-2.0 license. The architecture of CorPipe is described in the following paper:
Milan Straka and Jana Straková
Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Lingustics
Malostranské nám. 25, Prague, Czech Republic
Abstract: ÚFAL CorPipe is a winning submission to the CRAC 2022 Shared Task
on Multilingual Coreference Resolution. Our system first solves mention
detection and then coreference linking on the retrieved spans with an
antecedent-maximization approach, and both tasks are fine-tuned jointly with
shared Transformer weights. We report results of fine-tuning a wide range of
pretrained models. The center of this contribution are fine-tuned multilingual
models. We found one large multilingual model with sufficiently large encoder to
increase performance on all datasets across the board, with the benefit not
limited only to the underrepresented languages or groups of typologically
relative languages.
The directory data
is for the CorefUD 1.0 data, and the preprocessed
and tokenized version needed for training.
data/get.sh
downloads and extracts the CorefUD 1.0 training and
development data, plus the unannotated test data of the CRAC 2022 shared
task.The corpipe.py
is the complete CorPipe source file.
The corefud-score.sh
is an evaluation script used by corpipe.py
, which
validator
submodule) on the output data,corefud-scorer
submodule), both
without and with singletons.The res.py
is our script for visualizing performance of running and finished
experiments, and for comparing two experiments. It was developed for our needs
and we provide it as-is without documentation.
To train a single multilingual model on all the data, you should
data/get.sh
script to download the CorefUD data,requirements.txt
,train the model itself using the corpipe.py
script.
For training the large variant using RemBERT variant, we used
tb="ca_ancora cs_pcedt cs_pdt de_parcorfull de_potsdamcc en_gum en_parcorfull es_ancora fr_democrat hu_szegedkoref lt_lcc pl_pcc ru_rucor"
corpipe.py $(for c in $tb; do echo data/$c/$c; done) --resample=6000 4 5 5 1 2 3 1 4 4 3 2 4 3 --epochs=20 --lazy_adam --learning_rate_decay --crf --batch_size=8 --bert=google/rembert --learning_rate=1e-5 --segment=512 --right=50 --exp=large-rembert
To train a base variant using XLM-R base, we used
tb="ca_ancora cs_pcedt cs_pdt de_parcorfull de_potsdamcc en_gum en_parcorfull es_ancora fr_democrat hu_szegedkoref lt_lcc pl_pcc ru_rucor"
corpipe.py $(for c in $tb; do echo data/$c/$c; done) --resample 6000 4 5 5 1 2 3 1 4 4 3 2 4 3 --epochs=30 --lazy_adam --learning_rate_decay --crf --batch_size=8 --bert=jplu/tf-xlm-roberta-base --learning_rate=2e-5 --segment=512 --right=50 --exp=base-xlmr
Currently no model is saved; instead, the script performs and saves predictions on both the development and test sets after every epoch.