PheneBank aims at automatic extraction and validation of a database of human phenotype-disease associations in the scientific literature. This package provides code, data, and models for the following three purposes:
The model is trained to support 9 categories of entities:
Map an entity to its corresponding concept in any of the following 5 ontologies:
Given an input text, extract its entities and map each to its corresponding concept in the ontologies (a pipeline containing both previous stages).
Download the followings:
To get started with the pipeline, first obtain the required data
and decompress them in the project directory.
Then, import pipeline
into your project:
from pipeline import pipe
pp = pipe()
input_text = "Risk factors for recurrent respiratory infections in preschool children in China."
Find entities in an input text:
pp.tag(input_text)
The output will look like the following (formatted for clarity). Lists of tuples, one tuple per sentence. Each tuple contains two lists: words and their corresponding tags.
[
(['Risk', 'factors', 'for', 'recurrent', 'respiratory', 'infections', 'in', 'China.'],
['O', 'O', 'O', 'B-Phenotype', 'I-Phenotype', 'I-Phenotype', 'O', 'O'])
]
Find entities in the text and harmonise (map) them to their corresponding ontologies:
pp.tag_harmonise(input_text)
The output will have each sentence as a list of tuples. Each tuple has three parts: word, tag (Null if not an entity), (the list of) corresponding concept IDs ([] if no mapping was found).
[
[
('Risk', 'Null', []),
('factors', 'Null', []),
('for', 'Null', []),
('recurrent respiratory infections', 'Phenotype', [('HP:0002205', 1.0)]),
('in', 'Null', []),
('China', 'Null', [])
]
]
data
directory.utils/project_config.py
.ontology_embedding.py
script under grounding to create a new semantic embedding.You can use the following command in the "embeddings" directory to binarise the ontology embedding:
$ ./convertvec txt2bin [embedding.txt] [embedding.bin]
(convertvec script from https://github.com/marekrei/convertvec)
The tagging stage relies on Anago, a Bidirectional LSTM-CRF for Sequence Labeling: https://github.com/Hironsan/anago
M.T. Pilehvar, D. Smedley, A. Bernard, and N. Collier: PheneBank: a literature-based database of phenotypes. Bioinformatics, Volume 38, Issue 4, 2022.