sxysxy / NERTasks

GNU Lesser General Public License v3.0
12 stars 4 forks source link

NERTasks

Contents

What's It?

1. Requrements

2. Install Dependencies

3. How To Prepare Datasets

4. Experiments

  4.1 Hyper Parameters

  4.2 Model Parameters

  4.3 Results

    4.3.1 Full Data Results

    4.3.2 Few Shot Results

5. Acknowledgement And Citations

  5.1 People And Orgnizations

  5.2 Third-Party Libraries

What's It?

A simple NER framework.

It implements:

ItemSource/Reference
Models
BiLSTM-LinearLong Short-term Memory
BiLSTM-Linear-CRFNeural Architectures for Named Entity Recognition
BERT-LinearBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT-Linear-CRF
BERT-BiLSTM-Linear
BERT-BiLSTM-Linear-CRF
BERT(Prompt)
EntLM Approach
Template-free Prompt Tuning for Few-shot NER
Datasets
CoNLL2003 yuanxiaosc/BERT-for-Sequence-Labeling-and-Text-Classification
OntoNotes5LDC2013T19
CCKS2019 Subtask 1TIANCHI (NER on Chinese medical documents.)
NCBI-diseaseBioBERT (NER on English medical doucments, Got these datasets from its download.sh)
Traning Tricks
Gradient Accumulation
Learning Rate Warmup
Misc
Tokenizer from datasetsSee myutils.py
NER Metricsseqeval: A Python framework for sequence labeling evaluation

You can easily add your own models and datasets into this framework.

Requirements:

Linux(Tested)/Windows(Not Tested) with Nvidia GPUs.

Install Dependencies.

Recommend to use conda creating a python environment(python==3.9). For example:

conda create -n NER python=3.9

And run the bash script. If you are using windows, change its extname to .bat.

./install_dependencies.sh

How To Prepare Datasets

For some reason(copyright and some other things), I can't directly provide datasets to you. You should get the access to these datasets by yourself and put them in specified format into 'assert/raw_datasets' folder, see here.

Experiments

Hyper Parameters

Optimizer Weight
Decay
Warmup
Ratio
Batch Size Gradient
Accumulation
Clip Grad Norm Random Seed
AdamW 5e-3 0.2 1 32 1.0 233

Training Epoches:

DatasetFull DataFew Shot
CoNLL2003 12 30
OntoNotes5(Chinese)
CCKS2019
NCBI-disease20

Learning Rates:

CoNLL2003 OntoNotes5 CCKS2019 NCBI-disease
BiLSTM-Linear 0.001 NA
BiLSTM-Linear-CRF
BERT-Linear 0.0001
BERT-Linear-CRF
BERT-BiLSTM-Linear 0.0001 3e-5 NA
BERT-BiLSTM-Linear-CRF 1e-5
BERT(Prompt) 3e-5 0.0001

Model Parameters

BERT Model Embedding Size(For models without BERT) LSTM Hidden Size LSTM Layers
bert-base-uncased(CoNLL2003,NCBI-disease) 256 256 2
bert-base-chinese(OntoNotes5,CCKS2019)

Results

Full Data Results

General datasets(ConLL2003, OntoNotes5(Chinese)).

Dataset Model Overall Span-Based Micro F1 Average Training Time Per Epoch
(On a Quadro RTX8000)
CoNLL2003BiLSTM-Linear 0.6517005491858561 13.98s
BiLSTM-Linear-CRF 0.6949365863103882 44.07s
BERT-Linear 0.8983771483322356 81.81s
BERT-Linear-CRF 0.8977943835121128 120.94s
BERT-BiLSTM-Linear 0.8819152766110644 117.37s
BERT-BiLSTM-Linear-CRF 0.8873846891098599 130.85s
BERT(Prompt) 0.9230769230769231 99.70s
OntoNotes5
(Chinese)
BiLSTM-Linear 0.637999350438454 160.55s
BiLSTM-Linear-CRF 0.7033358449208851 319.87s
BERT-Linear 0.7403041825095057 413.20s
BERT-Linear-CRF 0.7535838822161953 595.71s
BERT-BiLSTM-Linear 0.7511438739196745 590.53s
BERT-BiLSTM-Linear-CRF 0.7616389699353039 800.23s
BERT(Prompt) 0.7376454875023851 485.56s

Medical datasets, used general bert and medical bert.

DatasetBERTModelOverall Span-Based Micro F1
CCKS2019
Subtask1
bert-
base-
chinese
BERT-Linear0.8057400574005741
BERT-Linear-CRF0.8119778310861113
BERT-Prompt0.7684884784959654
medbert-
base-
chinese
BERT-Linear0.8201214508452324
BERT-Linear-CRF0.8221622063998691
BERT-Prompt0.7933091394485463
NCBI-disease bert-
base-
uncased
BERT-Linear0.8720903433970961
BERT-Linear-CRF0.8778661675245673
BERT-Prompt0.8319672131147542
biobert-base-
cased-v1.2
BERT-Linear0.8730125079499682
BERT-Linear-CRF0.8774603174603175
BERT-Prompt0.8549382716049383

Few Shot Results

Sampling 1% data in trainset by fixed random seed. They used the same hyper parameters in full data experiments.

Few Shot Test on CoNLL2003:

Sampled 69 samples(total 6973). Here list the number of entities in few shot dataset:

{'MISC': 51, 'ORG': 51, 'PER': 59, 'LOC': 90}
ModelOverall Span-Based F1 On Full Testset
bert-base-uncased
BERT-Linear0.6778304852260387
BERT-Linear-CRF0.6773130256876562
BERT-Prompt0.7524185216492908
BERT-BiLSTM-Linear0.037065541975802724
BERT-BiLSTM-Linear-CRF0.029508301201363885

Few Shot Test on CCKS2019:

Sampled 10 samples(total 1000). Here list the numbers of entities in few shot dataset:

{'手术': 9, '影像检查': 5, '疾病和诊断': 45, '解剖部位': 48, '实验室检验': 19, '药物': 10}
ModelOverall Span-Based F1 On Full Testset
bert-base-chinesemedbert-base-chinese
BERT-Linear0.439180645705133540.47296831955922863
BERT-Linear-CRF0.479018079283463240.537369759619329
BERT-Prompt0.00388523610280932470.43338090840399623

Few Shot Test on NCBI-disease:

Sampled 54 samples(total about 5400) contains 41 disease entities.

ModelOverall Span-Based F1 On Full Testset
bert-base-uncasedmedbert-base-chinese
BERT-Linear0.64499769479022580.6617314414970182
BERT-Linear-CRF0.65206124852767970.6777615976700645
BERT-Prompt0.55525997581620310.5978593272171254

Acknowledgement And Citations

People And Orgnizations

Third-Party Libraries