AutoNER

Check Our New NER Toolkit🚀🚀🚀

Inference:
- LightNER: inference w. models pre-trained / trained w. any following tools, efficiently.
Training:
- LD-Net: train NER models w. efficient contextualized representations.
- VanillaNER: train vanilla NER models w. pre-trained embedding.
Distant Training:
- AutoNER: train NER models w.o. line-by-line annotations and get competitive performance.

No line-by-line annotations, AutoNER trains named entity taggers with distant supervision.

Details about AutoNER can be accessed at: https://arxiv.org/abs/1809.03599

Model notes
Benchmarks
Training
Citation

Model Notes

AutoNER-Framework

Benchmarks

Method	Precision	Recall	F1
Supervised Benchmark	88.84	85.16	86.96
Dictionary Match	93.93	58.35	71.98
Fuzzy-LSTM-CRF	88.27	76.75	82.11
AutoNER	88.96	81.00	84.80

Training

Required Inputs

Tokenized Raw Texts
- Example: data/BC5CDR/raw_text.txt
- One token per line.
- An empty line means the end of a sentence.
Two Dictionaries
- Core Dictionary w/ Type Info
- Example: data/BC5CDR/dict_core.txt
  - Two columns (i.e., Type, Tokenized Surface) per line.
  - Tab separated.
- How to obtain?
  - From domain-specific dictionaries.
- Full Dictionary w/o Type Info
- Example: data/BC5CDR/dict_full.txt
  - One tokenized high-quality phrases per line.
- How to obtain?
  - From domain-specific dictionaries.
  - Applying the high-quality phrase mining tool on domain-specific corpus.
  - AutoPhrase
Pre-trained word embeddings
- Train your own or download from the web.
- The example run uses embedding/bio_embedding.txt, which can be downloaded from our group's server. For example, curl http://dmserv4.cs.illinois.edu/bio_embedding.txt -o embedding/bio_embedding.txt. Since the embedding encoding step consumes quite a lot of memory, we also provide the encoded file in the autoner_train.sh.
[Optional] Development & Test Sets.
- Example: data/BC5CDR/truth_dev.ck and data/BC5CDR/truth_test.ck
- Three columns (i.e., token, Tie or Break label, entity type).
- I is Break.
- O is Tie.
- Two special tokens <s> and <eof> mean the start and end of the sentence.

Dependencies

This project is based on python>=3.6. The dependent package for this project is listed as below:

numpy==1.13.1
tqdm
torch-scope>=0.5.0
pytorch==0.4.1

Command

To train an AutoNER model, please run

./autoner_train.sh

To apply the trained AutoNER model, please run

./autoner_test.sh

You can specify the parameters in the bash files. The variables names are self-explained.

Citation

Please cite the following two papers if you are using our tool. Thanks!

Jingbo Shang, Liyuan Liu, Xiaotao Gu, Xiang Ren, Teng Ren and Jiawei Han, "Learning Named Entity Tagger using Domain-Specific Dictionary", in Proc. of 2018 Conf. on Empirical Methods in Natural Language Processing (EMNLP'18), Brussels, Belgium, Oct. 2018. (* Equal Contribution)
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", accepted by IEEE Transactions on Knowledge and Data Engineering, Feb. 2018.

@inproceedings{shang2018learning,
  title = {Learning Named Entity Tagger using Domain-Specific Dictionary}, 
  author = {Shang, Jingbo and Liu, Liyuan and Ren, Xiang and Gu, Xiaotao and Ren, Teng and Han, Jiawei}, 
  booktitle = {EMNLP}, 
  year = 2018, 
}

@article{shang2018automated,
  title = {Automated phrase mining from massive text corpora},
  author = {Shang, Jingbo and Liu, Jialu and Jiang, Meng and Ren, Xiang and Voss, Clare R and Han, Jiawei},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {2018},
  publisher = {IEEE}
}

shangjingbo1226 / AutoNER

readme