universal-ie / UIE

Unified Structure Generation for Universal Information Extraction
860 stars 98 forks source link

UIE

Update

Requirements

General

Python Packages CUDA 10.2

conda create -n uie python=3.8
conda install -y pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt

CUDA 11.1

conda create -n uie python=3.8
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Quick Start

Datasets of Extraction Tasks

Details of preprocessing see Data preprocessing.

After that, please link the preprocessed dataset as:

ln -s dataset_processing/converted_data/ data

Data Format

Data folder contains seven files:

data/text2spotasoc/absa/14lap
├── entity.schema       # Entity Types for converting SEL to Record
├── relation.schema     # Relation Types for converting SEL to Record
├── event.schema        # Event Types for converting SEL to Record
├── record.schema       # Spot/Asoc Type for constructing SSI
├── test.json
├── train.json
└── val.json

train/val/test.json are data files, and each line is a JSON instance. Each JSON instance contains text and record fields, in which text is plain text, and record is the SEL representation of the extraction structure. Details definition see DATASETS.md.

Note:

Token Role
Start of Label Name
End of Label Name
Start of Input Text
Start of Text Span
NULL span for Rejection

Pretrained Models

You can find the pre-trained models as following CAS Cloud Box/Google Drive links or download models using command gdown (pip install gdown).

uie-en-base [CAS Cloud Box] [Google Drive] [Huggingface]

uie-en-large [CAS Cloud Box] [Google Drive] [Huggingface]

uie-char-small (chinese) [CAS Cloud Box]

# Example of Google Drive
gdown 12Dkh6KLDPvXrkQ1I-1xLqODQSYjkwnvs && unzip uie-base-en.zip
gdown 15OFkWw8kJA1k2g_zehZ0pxcjTABY2iF1 && unzip uie-large-en.zip

Put all models to hf_models/ for default running scripts.

Model Fine-tuning

First make directories otuput.

Training scripts as follows:

The command for the training is as follows (see bash scripts and Python files for the corresponding command-line arguments):

. config/data_conf/base_model_conf_absa.ini  && model_name=uie-base-en dataset_name=absa/14lap bash scripts_exp/run_exp.bash

Trained models are saved in the output_dir specified by run_uie_finetune.bash.

Simple Training Command

bash run_uie_finetune.bash -v -d 0 \
  -b 16 \
  -k 3 \
  --lr 1e-4 \
  --warmup_ratio 0.06 \
  -i absa/14lap \
  --epoch 50 \
  --spot_noise 0.1 \
  --asoc_noise 0.1 \
  -f spotasoc \
  --epoch 50 \
  --map_config config/offset_map/closest_offset_en.yaml \
  -m hf_models/uie-base-en \
  --random_prompt

Progress logs

...
***** Running training *****
  Num examples = 906
  Num Epochs = 50
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2850
  Num examples = 219
  Batch size = 64
...

Final Result (specific scores may different from different machines and environments)

...
test offset-rel-strict-P 67.01461377870564
test offset-rel-strict-R 59.11602209944752
test offset-rel-strict-F1 62.81800391389433
...
Metric Definition
ent-(P/R/F1) Micro-F1 of Entity (Entity Type, Entity Span)
rel-strict-(P/R/F1) Micro-F1 of Relation Strict (Relation Type, Arg1 Span, Arg1 Type, Arg2 Span, Arg2 Type)
rel-boundary-(P/R/F1) Micro-F1 of Relation Boundary (Relation Type, Arg1 Span, Arg2 Span)
evt-trigger-(P/R/F1) Micro-F1 of Event Trigger (Event Type, Trigger Span)
evt-role-(P/R/F1) Micro-F1 of Event Argument (Event Type, Arg Role, Arg Span)

Model Pre-training

[TODO] Add detailed decription.

Data Collator

We construct different sequence-to-sequence tasks using different data collators.

HybirdDataCollator

We unify different types of (text, strcuture) pairs for pre-training with HybirdDataCollator. It contains multiple data collators for different instances:

DataCollatorForMetaSeq2Seq

Sampling Strategy and Rejection Mechanism can be adopted in the training process.

Related parameters in class DataTrainingArguments are briefly introduced here:

Scripts for Model Evaluation

To verify the performance of the UIE requires converting the generated SEL expression into Record and then evaluating it.

1. Convert structured expressions to record structures (sel2record.py)

After training, pred_folder will contain 'eval_preds_seq2seq.txt' or 'test_preds_seq2seq.txt'

 $ python scripts/sel2record.py -h     
usage: sel2record.py [-h] [-g GOLD_FOLDER] [-p PRED_FOLDER [PRED_FOLDER ...]] [-c MAP_CONFIG] [-d DECODING] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -g GOLD_FOLDER        folder of golden answer
  -p PRED_FOLDER [PRED_FOLDER ...]
                        multiple different prediction folders
  -c MAP_CONFIG, --config MAP_CONFIG
                        offset matching strategy configuration file, more configuration files are placed in config/offset_map
  -d DECODING           specify structure parser, default is SpotAsoc structure
  -v, --verbose         print more detailed log information

2. Validate model performance (eval_extraction.py)

After converting, pred_folder will contain 'eval_preds_record.txt' or 'test_preds_record.txt'

 $ python scripts/eval_extraction.py -h   
usage: eval_extraction.py [-h] [-g GOLD_FOLDER] [-p PRED_FOLDER [PRED_FOLDER ...]] [-v] [-w] [-m] [-case]

optional arguments:
  -h, --help            show this help message and exit
  -g GOLD_FOLDER        Golden Dataset folder
  -p PRED_FOLDER [PRED_FOLDER ...]
                        Predicted model folder
  -v                    Show more information during running
  -w                    Write evaluation results to predicted folder
  -m                    Refers to the matching policy
  -case                 Show case study

3. Verify the performance of the mapping label (check_offset_map_gold_as_pred.bash)

To verify the effect of structure parser, we took the golden answer SEL as the prediction result, and evaluate its performance.

bash scripts/check_offset_map_gold_as_pred.bash <data-folder> <map-config>

Citation

If this repository helps you, please cite this paper:

Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, Hua Wu. Unified Structure Generation for Universal Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5755–5772, Dublin, Ireland. Association for Computational Linguistics.

@inproceedings{lu-etal-2022-unified,
    title = "Unified Structure Generation for Universal Information Extraction",
    author = "Lu, Yaojie  and
      Liu, Qing  and
      Dai, Dai  and
      Xiao, Xinyan  and
      Lin, Hongyu  and
      Han, Xianpei  and
      Sun, Le  and
      Wu, Hua",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.395",
    pages = "5755--5772",
}

License

The code is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for Noncommercial use only. Any commercial use should get formal permission first.