lyutyuh / ASP

PyTorch implementation and pre-trained models for ASP - Autoregressive Structured Prediction with Language Models, EMNLP 22. https://arxiv.org/pdf/2210.14698.pdf
MIT License
99 stars 15 forks source link
coreference-resolution emnlp2022 huggingface information-extraction named-entity-recognition natural-language-processing pytorch pytorch-implementation relation-extraction structured-prediction t5 transformer

Autoregressive Structured Prediction with Language Models

This repository contains PyTorch implementation and pre-trained models for ASP, described in Autoregressive Structured Prediction with Language Models.

Links: ETH-NLPED lab , Rycolab

Contents

Setup

1. Clone this repo:

git clone https://github.com/lyutyuh/ASP.git
cd ASP
export ASP=$PWD # setting environment variable

2. Prepare the environment

2.1 Create virtual environment with:

pip ```bash python -m venv /asp # create a new environment (asp) source /asp/bin/activate pip install -r requirements.txt ```

or

conda ```bash conda env create -f environment.yml # create a new environment (asp) ```

Download and preprocess the datasets

named entity recognition ### CoNLL-03 ```bash wget https://polybox.ethz.ch/index.php/s/bFf8vJBonIT7sr8/download -O ./data/conll03_ner.zip unzip ./data/conll03_ner.zip -d ./data rm ./data/conll03_ner.zip python ./data/conll03_ner/conll03_to_json.py python ./data/t5minimize_ner.py ./data/conll03_ner ./data/conll03_ner ``` ### OntoNotes V5 Coming soon!
end-to-end relation extraction ### CoNLL-04 ```bash wget https://polybox.ethz.ch/index.php/s/Lk44AwhOeDSeZTh/download -O ./data/conll04_ere.zip unzip ./data/conll04_ere.zip -d ./data rm ./data/conll04_ere.zip python ./data/t5minimize_ere.py ./data/conll04_ere/ ./data/conll04_ere ``` ### ACE-05 ACE-05 is not a publically available dataset. Please follow https://github.com/luanyi/DyGIE/tree/master/preprocessing to obtain the dataset json files ```{train,dev,test}.json``` and copy them to ```./data/ace05_ere/```. Then: ```bash python ./data/ace05_ere/ace05_to_json.py python ./data/t5minimize_ere.py ./data/ace05_ere ./data/ace05_ere ```
coreference resolution ### CoNLL-12 (OntoNotes) OntoNotes is not a publically available dataset. Please follow http://conll.cemantix.org/2012/data.html and https://catalog.ldc.upenn.edu/LDC2013T19 to obtain the files ```{train,dev,test}.english.v4_gold_conll``` and copy them to ```./data/ontonotes_coref/```. Then: ```bash python ./data/t5minimize_coref.py ./data/ontonotes_coref/ ./data/ontonotes_coref/ ```

Tasks

For task in {ner,ere,coref}:

  python run_{task}.py <config_name> 0 

Please find the <config_name> in each {ner,ere,coref}.conf file under configs

Running on New Datasets

1. prepare the data

{
    "entities": {
        "Loc": {"short": "Loc", "verbose": "Location"}, 
        "Org": {"short": "Org", "verbose": "Organization"}, 
        "Peop": {"short": "Peop", "verbose":"People"}, 
        "Other": {"short": "Other", "verbose": "Other"}
    }, 
    "relations": { # Not necessary for NER
        "Work_For": {"short": "Work", "verbose": "Work for", "symmetric": false}, 
        "Kill": {"short": "Kill", "verbose": "Kill", "symmetric": false}, 
        "OrgBased_In": {"short": "OrgBI", "verbose": "Organization based in", "symmetric": false}, 
        "Live_In": {"short": "Live", "verbose": "Live in", "symmetric": false}, 
        "Located_In": {"short": "LocIn", "verbose": "Located in", "symmetric": false}
    }
}

and run

  python ./data/t5minimize_ere.py ./data/<newdataset>/ ./data/<newdataset>/

2. Prepare the configuration

Add a new entry in the corresponding .conf file under configs with the directory to the new dataset data_dir = ${ASP}/data/<newdataset>/

Pre-trained models

Use the following command to load the pre-trained model and evaluate on the corresponding task. config_name refers to the experiment name given in the .conf file under configs.

python evaluate_<task>.py <config_name> <checkpoint_name> <gpu_id>

1. Coreference resolution

config_name checkpoint_name dataset link params
flant5_base tliu/asp-coref-flan-t5-base CoNLL-2012 (OntoNotes) link 220 M
flant5_large tliu/asp-coref-flan-t5-large CoNLL-2012 (OntoNotes) link 770 M
flant5_xl tliu/asp-coref-flan-t5-xl CoNLL-2012 (OntoNotes) link 3 B
t0_3b tliu/asp-coref-t0-3b CoNLL-2012 (OntoNotes) link 3 B

2. Named entity recognition (NER)

config_name checkpoint_name dataset link params
flant5_base tliu/asp-ner-flan-t5-base CoNLL-03 NER link 220 M
flant5_large tliu/asp-ner-flan-t5-large CoNLL-03 NER link 770 M

3. End-to-end relation extraction (ERE)

config_name checkpoint_name dataset link params
flant5_base_conll04 tliu/asp-re-flan-t5-base CoNLL-04 RE link 220 M
flant5_large_conll04 tliu/asp-re-flan-t5-large CoNLL-04 RE link 770 M
flant5_xl_conll04 tliu/asp-re-flan-t5-xl CoNLL-04 RE link 3 B

Citation

@inproceedings{liu-etal-2022-autoregressive,
    title={Autoregressive Structured Prediction with Language Models},
    author={Tianyu Liu and Yuchen Jiang and Nicholas Monath and Ryan Cotterell and Mrinmaya Sachan},
    year={2022},
    url={https://arxiv.org/abs/2210.14698},
    eprint={2210.14698},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}