Chinese-NRE

Update: We release the manually annotated financial relation extraction dataset FinRE in data/FinRE, which contains 44 relations (bidirectional) and 18000+ instances. Feel free to download and obtain the dataset, and please cite our paper if you use the dataset in your work.

Source code for ACL 2019 paper "Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge". Some code in this repository is based on the excellent open-source project https://github.com/jiesutd/LatticeLSTM.

Requirements

Python 3.6
Pytorch 0.4.1

Datasets

Three datasets are used in our paper:

FinRE: A manual-labeled financial news RE dataset. The data cannot be made public for the time being.
SanWen: A Chinese literature NER-RE dataset, the source of the dataset is https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset.
ACE 2005: A benchmark RE dataset. According to the terms of LDC, we are not allowed to share the dataset with the third party. If you have the LDC license, please obtain the dataset (LDC2006T06) and follow the data format by yourself.

In this project, train.txt , dev.txt and test.txt are all from SanWen.

Data Format

Input Format

data/SanWen/train.txt, dev.txt, test.txt One instance per line with 4 columns separated by tab character. The first and second columns are head and tail entities. The third column is the relation label and the last one is text:

[head]  [tail]  [relation]    text

For example ( one line ):

 湖底    卵石    Located     连湖底的卵石颜色也可分辨

data/SanWen/relation2id.txt One relation per line with 2 columns separated by tab character. The first column is teh label while the second one is the corresponding ID:

[relation]  [ID]

Pre-trained Character Embeddings

data/vec.txt One character per line. For each line, the first column is the character, the rest columns is the value of the embedding of the character.

Pre-trained Word-Sense Embeddings

data/sense.txt Similar to character embedding but for word senses. For example:

释放#1 0.304095 ...
释放#2 -0.175496 ...
夏天 -0.230772 ...

Here, A#n means that it is the n-th sense of word A ( A is a polysemous word ). And the word-sense embeddings could be trained by the SAT (Sememe Attention over Target) approach.

Word-Sense Map

data/sense_map.txt Recording all senses for each polysemous word, corresponding to the word sense embedding. One word per line, for each line, the first column is the word, and the rest columns are all the senses of it ( if exist ). For example:

释放 释放#1 释放#2
夏天

The sense_map file could be obtained by HowNet.

Data Preparation

You can download the pre-trained character embeddings vec.txt, pre-trained word-sense embeddings sense.txt and word-sense map sense_map.txt from Tsinghua Cloud or Google Drive. Then put them in place following the folder structure:

MG-Lattice
|-- ...
|-- data
    |
    |-- sense.txt
    |
    |-- vec.txt
    |
    |-- sense_map.txt
    |
    |-- DATASET_NAME_1
    |   |
    |   |-- train.txt
    |   |-- valid.txt
    |   |-- test.txt
    |   |-- relation2id.txt
        |
    |-- DATASET_NAME_2
            |-- ...

How to run

Arguments are set in configure.py, the default values are for SanWen dataset. The full usage is:

-- savemodel            path to save the model                  
-- loadmodel            path to load the model                  
-- savedset         path to load the data settings          

-- public_path          the parent path of the dataset          (data/)
-- dataset                  the folder name of dataset          (SanWen/)
-- train_file           train dataset                   (train.txt)
-- dev_file         developement dataset                (dev.txt)
-- test_file            test dataset                    (test.txt)
-- relation2id          map relation to id                  (relation2id.txt)
-- char_emb_file        pre-trained char embeddings             (vec.txt)
-- sense_emb_file       pre-trained sense embeddings            (sense.txt)
-- word_sense_map       record polysemous words             (sense_map.txt)
-- max_length           the max length of the input             

-- Encoder          Specify which encoder to use
-- Optimizer            Specify which optimizier to use
-- lr               learning rate                           
-- weights_mode         mode to set weights for each class in loss function

With appropriate configuration and data preparation, you can run the model by:

python main.py

Citation

If you use the code, please cite the paper:

@inproceedings{li2019chinese,
 title={Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge},
 author={Li, Ziran and Ding, Ning and Liu, Zhiyuan and Zheng, Hai-Tao and Shen, Ying},
 booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
 pages={4377--4386},
 year={2019}
}

thunlp / Chinese_NRE

readme