本项目主要基于Pytorch, 验证常见的NER范式模型在不同中文NER数据集上(Flat、Nested、Discontinuous)的表现 NER系列模型实践,包括如下:
mainly tested on ner dataset as below:
中文NER数据集:
关于一般NER数据处理成以下格式:
{
"text": ["吴", "重", "阳", ",", "中", "国", "国", "籍",","],
"label": ["B-NAME", "I-NAME", "I-NAME", "O", "B-CONT", "I-CONT", "I-CONT", "I-CONT", "O"]
}
阅读理解-NER(MRC-NER)处理成以下格式:
{
"context": "图 为 马 拉 维 首 都 利 隆 圭 政 府 办 公 大 楼 。 ( 本 报 记 者 温 宪 摄 )",
"end_position": [4,15],
"entity_label": "NS",
"impossible": false,
"qas_id": "3820.1",
"query": "按照地理位置划分的国家,城市,乡镇,大洲",
"span_position": ["2;4", "7;15"],
"start_position": [2, 7]
}
python==3.8、transformers>=4.12.3、torch==1.8.0 Or run the shell
pip install -r requirements.txt
you can start training model by run the shell
bash script/train.sh
bash script/mrc_train.sh
top F1 score of results on test:
model/f1_score | Msra | Ontonote |
---|---|---|
BERT-Sotfmax | 0.9553 | 0.8181 |
BERT-BiLSTM-Sotfmax | 0.9566 | 0.8177 |
BERT-BiLSTM-LabelSmooth | 0.9549 | 0.8215 |
BERT-Crf | 0.9562 | 0.8218 |
BERT-BiLSTM-Crf | 0.9561 | 0.8227 |
BERT-BiLSTM-Crf-LabelSmooth | 0.9547 | 0.8216 |
BERT-BiLSTM-Crf-LEBERT | 0.9518 | 0.8094 |
BERT-BiLSTM-Sotfmax-LEBERT | 0.9544 | 0.8196 |
MRC | 0.942 | 0.812 |
GPU: 3060TI 8G
在速度上,以Msra数据集为例,train数据量41728, 完成训练花费时间大概是如下,总体来说CRF要慢不少。
model | time | batch_size |
---|---|---|
BERT-Sotfmax | 6min 14s | 24 |
BERT-BiLSTM-Sotfmax | 6min 46s | 24 |
BERT+Crf | 8min 06s | 24 |
BERT-BiLSTM-Crf | 8min 20s | 24 |
MRC | 50min 10s | 4 |