An model take both character and words as input for Chinese NER task.
Models and results can be found at our NAACL 2019 paper "An Encoding Strategy based Word-Character LSTM for Chinese NER". It achieves state-of-the-art performance on most of the dataset.
Most of the code is written with reference to Yang Jie's "NCRF++". To know more about "NCRF++", please refer to the paper "NCRF++: An Open-source Neural Sequence Labeling Toolkit".
Python 3.6
Pytorch: 0.4.0
if you want to use the tensorboard in our code, you should also install the followings:
tensorboardX 1.2
tensorflow 1.6.0
CoNLL format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.
美 B-LOC
国 E-LOC
的 O
华 B-PER
莱 I-PER
士 E-PER
我 O
跟 O
他 O
谈 O
笑 O
风 O
生 O
Character embeddings: gigword_chn.all.a2b.uni.ite50.vec
Word embeddings: ctb.50d.vec
put each dataset to the data dir, and then simply run the .py file.
For example, to run Weibo experiment, just run: python3 weibo.py
@inproceedings{liu-etal-2019-encoding, \ title = "An Encoding Strategy Based Word-Character {LSTM} for {C}hinese {NER}", \ author = "Liu, Wei and Xu, Tongge and Xu, Qinghua and Song, Jiayu and Zu, Yueran", \ booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies", \ year = "2019", \ publisher = "Association for Computational Linguistics", \ url = "https://www.aclweb.org/anthology/N19-1247", \ pages = "2379--2389" \ }