hammerlab / t-cell-relation-extraction

Literature mining for T cell relations
23 stars 5 forks source link

T Cell Relation Extraction (TCRE)

This repository contains the scripts and analysis necessary to extract relationships between T cells, cytokines, and transcription factors from a large PMC corpus using Data Programming. In short, the purpose of this research is to identify relations like this often referenced as a small part of larger cell signaling networks:

Information Flow

The relations are identified by a weakly supervised classifier trained using distant supervision from immuneXpresso, heuristics, text patterns, and standard supervised classifiers trained on a small manually labeled data split. Snorkel is used to develop a generative model on top of the classifications from these different sources and the weak labels from that model are then fed into a noise-aware classifier (trained on ~50k examples per relation). A high-level overview of this information flow is shown below:

Resources

This Summary Notebook contains a rolling account of many details such as how documents were selected, what labeling functions were developed, tokenization challenges, controlled vocabularies, preliminary classification performance results, etc.

An early draft of a pre-print is also available at Extracting T Cell Function and Differentiation Characteristics from the Biomedical Literature.

Additional links: