Welcome to the repository of Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation — HYFA.
Overview of HYFA
HYFA processes gene expression from a number of collected tissues (e.g. accessible tissues) and infers the transcriptomes of uncollected tissues.
HYFA Workflow
- The model receives as input a variable number of gene expression samples $x^{(k)}_i$ corresponding to the collected tissues $k \in \mathcal{T}(i)$ of a given individual $i$. The samples $x^{(k)}_i$ are fed through an encoder that computes low-dimensional representations $e^{(k)}_{ij}$ for each metagene $j \in 1 .. M$. A metagene is a latent, low-dimensional representation that captures certain gene expression patterns of the high-dimensional input sample.
- These representations are then used as hyperedge features in a message passing neural network that operates on a hypergraph. In the hypergraph representation, each hyperedge labelled with $e^{(k)}_{ij}$ connects an individual $i$ with metagene $j$ and tissue $k$ if tissue $k$ was collected for individual $i$, i.e. $k \in \mathcal{T}(i)$. Through message passing, HYFA learns factorised representations of individual, tissue, and metagene nodes.
- To infer the gene expression of an uncollected tissue $u$ of individual $i$, the corresponding factorised representations are fed through a multilayer perceptron (MLP) that predicts low-dimensional features $e^{(u)}_{ij}$ for each metagene $j \in 1 .. M$. HYFA finally processes these latent representations through a decoder that recovers the uncollected gene expression sample $\hat{x}^{(u)}_{ij}$.
git clone https://github.com/rvinas/HYFA.git
pip install -r requirements.txt
The installation typically takes a few minutes.
To download the processed GTEx data, please follow these steps:
wget -O data/GTEx_data.csv.zip https://figshare.com/ndownloader/files/40208074
wget -O data/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt
unzip data/GTEx_data.csv.zip -d data
To download the pre-trained model, please run this command:
wget -O data/normalised_model_default.pth https://figshare.com/ndownloader/files/40208551
Prepare your dataset:
train_gtex.py
loads a dataset from a CSV file (GTEX_FILE
) with the following format:
tissue
denotes the tissue from which the sample was collected. The combination of donor and tissue identifier is unique. METADATA_FILE
; see function GTEx_metadata
in train_gtex.py
). Rows correspond to donors and columns to covariates. By default, the script expects at least two columns: AGE
(integer) and SEX
(integer). Example of gene expression CSV file:
, GENE1, GENE2, GENE3, tissue
INDIVIDUAL1, 0.0, 0.1, 0.2, heart
INDIVIDUAL1, 0.0, 0.1, 0.2, lung
INDIVIDUAL1, 0.0, 0.1, 0.2, breast
INDIVIDUAL2. 0.0, 0.1, 0.2, kidney
INDIVIDUAL3, 0.0, 0.1, 0.2, kidney
Example of metadata CSV file:
, AGE, SEX
INDIVIDUAL1, 34, 0
INDIVIDUAL2. 55, 1
INDIVIDUAL3, 49, 1
See the notebook hyfa_tutorial.ipynb
for an overview of the data format and main features of HYFA.
Run the script train_gtex.py
to train HYFA. This uses the default hyperparameters from config/default.yaml
. After training, the model will be stored in your current working directory. We recommend training the model on a GPU machine (training takes between 15 and 30 minutes on a NVIDIA TITAN Xp).
Once the model is trained, evaluate your results via the notebook evaluate_GTEx_v8_normalised.ipynb
.
hyfa_tutorial.ipynb
: Tutorial of the main features of HYFA.train_gtex.py
: Main script to train the multi-tissue imputation model on normalised GTEx dataevaluate_GTEx_v8_normalised.ipynb
: Analysis of multi-tissue imputation quality on normalised data (i.e. model trained via train_gtex.py
)evaluate_GTEx_v9_signatures_normalised.ipynb
: Analysis of cell-type signature imputation (i.e. fine-tunes model on GTEx-v9)src/data.py
: Data object encapsulating multi-tissue gene expressionsrc/dataset.py
: Dataset that takes care of processing the datasrc/data_utils.py
: Data utilitiessrc/hnn.py
: Hypergraph neural networksrc/hypergraph_layer.py
: Message passing on hypergraphsrc/hnn_utils.py
: Hypergraph model utilitiessrc/metagene_encoders.py
: Model transforming gene expression to metagene valuessrc/metagene_decoders.py
: Model transforming metagene values to gene expressionsrc/train_utils.py
: Train/eval loopssrc/distribions.py
: Count data distributionssrc/losses.py
: Loss functions for different data likelihoodssrc/pathway_utils.py
: Utilities to retrieve KEGG pathwayssrc/ct_signature_utils.py
: Utilities for inferring cell-type signaturesIf you use this code for your research, please cite our paper:
@article{vinas2023hypergraph,
title={Hypergraph factorization for multi-tissue gene expression imputation},
author={Vi{\~n}as, Ramon and Joshi, Chaitanya K and Georgiev, Dobrik and Lin, Phillip and Dumitrascu, Bianca and Gamazon, Eric R and Li{\`o}, Pietro},
journal={Nature Machine Intelligence},
pages={1--15},
year={2023},
publisher={Nature Publishing Group UK London}
}