Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation

Welcome to the repository of Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation — HYFA.

Overview of HYFA

HYFA processes gene expression from a number of collected tissues (e.g. accessible tissues) and infers the transcriptomes of uncollected tissues.

HYFA Workflow

The model receives as input a variable number of gene expression samples $x^{(k)}_i$ corresponding to the collected tissues $k \in \mathcal{T}(i)$ of a given individual $i$. The samples $x^{(k)}_i$ are fed through an encoder that computes low-dimensional representations $e^{(k)}_{ij}$ for each metagene $j \in 1 .. M$. A metagene is a latent, low-dimensional representation that captures certain gene expression patterns of the high-dimensional input sample.

These representations are then used as hyperedge features in a message passing neural network that operates on a hypergraph. In the hypergraph representation, each hyperedge labelled with $e^{(k)}_{ij}$ connects an individual $i$ with metagene $j$ and tissue $k$ if tissue $k$ was collected for individual $i$, i.e. $k \in \mathcal{T}(i)$. Through message passing, HYFA learns factorised representations of individual, tissue, and metagene nodes.

To infer the gene expression of an uncollected tissue $u$ of individual $i$, the corresponding factorised representations are fed through a multilayer perceptron (MLP) that predicts low-dimensional features $e^{(u)}_{ij}$ for each metagene $j \in 1 .. M$. HYFA finally processes these latent representations through a decoder that recovers the uncollected gene expression sample $\hat{x}^{(u)}_{ij}$.

Installation

Clone this repository: git clone https://github.com/rvinas/HYFA.git
Install the dependencies via the following command: pip install -r requirements.txt

The installation typically takes a few minutes.

Data download

To download the processed GTEx data, please follow these steps:

wget -O data/GTEx_data.csv.zip https://figshare.com/ndownloader/files/40208074
wget -O data/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt
unzip data/GTEx_data.csv.zip -d data

To download the pre-trained model, please run this command:

wget -O data/normalised_model_default.pth https://figshare.com/ndownloader/files/40208551

Running the model

Prepare your dataset:
- By default, the script train_gtex.py loads a dataset from a CSV file (GTEX_FILE) with the following format:
  - Columns are genes and rows are samples.
  - Entries correspond to normalised gene expression values.
  - The first row contains gene identifiers.
  - The first column contains donor identifiers. The file might contain multiple rows per donor.
  - An extra column tissue denotes the tissue from which the sample was collected. The combination of donor and tissue identifier is unique.
- The metadata is loaded from a separate CSV file (METADATA_FILE; see function GTEx_metadata in train_gtex.py). Rows correspond to donors and columns to covariates. By default, the script expects at least two columns: AGE (integer) and SEX (integer).
Example of gene expression CSV file:
```
 , GENE1, GENE2, GENE3, tissue
 INDIVIDUAL1, 0.0, 0.1, 0.2, heart
 INDIVIDUAL1, 0.0, 0.1, 0.2, lung
 INDIVIDUAL1, 0.0, 0.1, 0.2, breast
 INDIVIDUAL2. 0.0, 0.1, 0.2, kidney
 INDIVIDUAL3, 0.0, 0.1, 0.2, kidney
```
Example of metadata CSV file:
```
, AGE, SEX
INDIVIDUAL1, 34, 0
INDIVIDUAL2. 55, 1
INDIVIDUAL3, 49, 1
```
See the notebook hyfa_tutorial.ipynb for an overview of the data format and main features of HYFA.
Run the script train_gtex.py to train HYFA. This uses the default hyperparameters from config/default.yaml. After training, the model will be stored in your current working directory. We recommend training the model on a GPU machine (training takes between 15 and 30 minutes on a NVIDIA TITAN Xp).
Once the model is trained, evaluate your results via the notebook evaluate_GTEx_v8_normalised.ipynb.

Quick reference of main files

hyfa_tutorial.ipynb: Tutorial of the main features of HYFA.
train_gtex.py: Main script to train the multi-tissue imputation model on normalised GTEx data
evaluate_GTEx_v8_normalised.ipynb: Analysis of multi-tissue imputation quality on normalised data (i.e. model trained via train_gtex.py)
evaluate_GTEx_v9_signatures_normalised.ipynb: Analysis of cell-type signature imputation (i.e. fine-tunes model on GTEx-v9)

Data

src/data.py: Data object encapsulating multi-tissue gene expression
src/dataset.py: Dataset that takes care of processing the data
src/data_utils.py: Data utilities

Model

src/hnn.py: Hypergraph neural network
src/hypergraph_layer.py: Message passing on hypergraph
src/hnn_utils.py: Hypergraph model utilities
src/metagene_encoders.py: Model transforming gene expression to metagene values
src/metagene_decoders.py: Model transforming metagene values to gene expression

Training

src/train_utils.py: Train/eval loops
src/distribions.py: Count data distributions
src/losses.py: Loss functions for different data likelihoods

Other utils

src/pathway_utils.py: Utilities to retrieve KEGG pathways
src/ct_signature_utils.py: Utilities for inferring cell-type signatures

Citation

If you use this code for your research, please cite our paper:

@article{vinas2023hypergraph,
  title={Hypergraph factorization for multi-tissue gene expression imputation},
  author={Vi{\~n}as, Ramon and Joshi, Chaitanya K and Georgiev, Dobrik and Lin, Phillip and Dumitrascu, Bianca and Gamazon, Eric R and Li{\`o}, Pietro},
  journal={Nature Machine Intelligence},
  pages={1--15},
  year={2023},
  publisher={Nature Publishing Group UK London}
}

rvinas / HYFA

readme