Submitted Abstract:

Title:

Semi-supervised relation extraction using word vectors and syntax patterns

Author:

Harrison Pielke-Lombardo

Keywords:

NLP, ontology, machine learning

Significance:

The largest source of biomedical knowledge resides in the vast amount of biomedical literature. Currently, we lack methods to reliably extract relational information from unstructured text. There have been significant advances in named entity recognition and syntax parsing, so we are able to identify many important features of text. However, we still lack methods for determining the asserted relationships between entities from text. Co-occurrence and other methods have been attempted, but there still remain some significant barriers, including a lack of gold standard data to train and evaluate models with, difficulty generating the potentially complex semantic and syntactic patterns which imply relationships, and an unknown number of potential relations between entities.I present a semi-supervised, word embedding based method which has the potential to address each of these challenges.

Research State:

In Process

Submission Type:

Talk and poster

Abstract:

From identifying features in clinical notes to finding relevant reference material to creating knowledge graphs, there are many fields which could benefit from effective automatic relation extraction methods. However, there are some significant barriers to reliably identifying relations in the literature. These include a lack of gold standard data, many syntactic patterns that imply relations, and an unknown number of potential relations between any two concepts. I present a semi-supervised, word embedding based method which has the potential to address each of these challenges. A bootstrapping method is used which only requires a small number of seed relations to start the training process. Matched relations are then used to reseed the algorithm. Within a sentence, each concept has a context which I can use to generate extraction patterns. The tokens along the dependency path between the concepts in the sentence are used to form the context. The dependency path is computed using SyntaxNet. Then, those tokens are converted into word embeddings from BioSentVec (Q Chen et al, 2018) trained using Word2Vec (Tomas Mikolov et al, 2013). Word embeddings contain both the syntactic and semantic information of a word and are efficient to use for computation. The resulting word embeddings are combined to generate context embeddings which are clustered into extraction patterns. Sentences are matched by using the cosine similarity of the context embedding. This method can be evaluated using some of the few gold standard test sets. One of these is the dataset from the BioCreative VI competition (Islamaj Dogan et al. 2017) which focuses on finding relations between chemicals and protein targets and protein-protein interactions. Outside of the biomedical domain, there are also social media-based datasets that this method could be applied to and evaluated with. Other potential applications include determining disease status from clinical notes, identifying potential reference materials for researchers and clinicians, and expanding knowledge-bases.

tuh8888 commented 5 years ago

[x] Title
[x] Background/introduction
[x] Method
[x] Results
[x] Conclusion