rvinas / HYFA

Hypergraph Factorisation
MIT License
21 stars 4 forks source link

Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation

DOI License: MIT Python 3.8+

Welcome to the repository of Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation — HYFA.

Overview of HYFA

HYFA processes gene expression from a number of collected tissues (e.g. accessible tissues) and infers the transcriptomes of uncollected tissues.

HYFA Workflow

  1. The model receives as input a variable number of gene expression samples $x^{(k)}_i$ corresponding to the collected tissues $k \in \mathcal{T}(i)$ of a given individual $i$. The samples $x^{(k)}_i$ are fed through an encoder that computes low-dimensional representations $e^{(k)}_{ij}$ for each metagene $j \in 1 .. M$. A metagene is a latent, low-dimensional representation that captures certain gene expression patterns of the high-dimensional input sample.
  2. These representations are then used as hyperedge features in a message passing neural network that operates on a hypergraph. In the hypergraph representation, each hyperedge labelled with $e^{(k)}_{ij}$ connects an individual $i$ with metagene $j$ and tissue $k$ if tissue $k$ was collected for individual $i$, i.e. $k \in \mathcal{T}(i)$. Through message passing, HYFA learns factorised representations of individual, tissue, and metagene nodes.
  3. To infer the gene expression of an uncollected tissue $u$ of individual $i$, the corresponding factorised representations are fed through a multilayer perceptron (MLP) that predicts low-dimensional features $e^{(u)}_{ij}$ for each metagene $j \in 1 .. M$. HYFA finally processes these latent representations through a decoder that recovers the uncollected gene expression sample $\hat{x}^{(u)}_{ij}$.

Installation

  1. Clone this repository: git clone https://github.com/rvinas/HYFA.git
  2. Install the dependencies via the following command: pip install -r requirements.txt

The installation typically takes a few minutes.

Data download

To download the processed GTEx data, please follow these steps:

wget -O data/GTEx_data.csv.zip https://figshare.com/ndownloader/files/40208074
wget -O data/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt
unzip data/GTEx_data.csv.zip -d data

To download the pre-trained model, please run this command:

wget -O data/normalised_model_default.pth https://figshare.com/ndownloader/files/40208551

Running the model

  1. Prepare your dataset:

    • By default, the script train_gtex.py loads a dataset from a CSV file (GTEX_FILE) with the following format:
      • Columns are genes and rows are samples.
      • Entries correspond to normalised gene expression values.
      • The first row contains gene identifiers.
      • The first column contains donor identifiers. The file might contain multiple rows per donor.
      • An extra column tissue denotes the tissue from which the sample was collected. The combination of donor and tissue identifier is unique.
    • The metadata is loaded from a separate CSV file (METADATA_FILE; see function GTEx_metadata in train_gtex.py). Rows correspond to donors and columns to covariates. By default, the script expects at least two columns: AGE (integer) and SEX (integer).

    Example of gene expression CSV file:

     , GENE1, GENE2, GENE3, tissue
     INDIVIDUAL1, 0.0, 0.1, 0.2, heart
     INDIVIDUAL1, 0.0, 0.1, 0.2, lung
     INDIVIDUAL1, 0.0, 0.1, 0.2, breast
     INDIVIDUAL2. 0.0, 0.1, 0.2, kidney
     INDIVIDUAL3, 0.0, 0.1, 0.2, kidney

    Example of metadata CSV file:

    , AGE, SEX
    INDIVIDUAL1, 34, 0
    INDIVIDUAL2. 55, 1
    INDIVIDUAL3, 49, 1

    See the notebook hyfa_tutorial.ipynb for an overview of the data format and main features of HYFA.

  2. Run the script train_gtex.py to train HYFA. This uses the default hyperparameters from config/default.yaml. After training, the model will be stored in your current working directory. We recommend training the model on a GPU machine (training takes between 15 and 30 minutes on a NVIDIA TITAN Xp).

  3. Once the model is trained, evaluate your results via the notebook evaluate_GTEx_v8_normalised.ipynb.

Quick reference of main files

Data

Model

Training

Other utils

Citation

If you use this code for your research, please cite our paper:

@article{vinas2023hypergraph,
  title={Hypergraph factorization for multi-tissue gene expression imputation},
  author={Vi{\~n}as, Ramon and Joshi, Chaitanya K and Georgiev, Dobrik and Lin, Phillip and Dumitrascu, Bianca and Gamazon, Eric R and Li{\`o}, Pietro},
  journal={Nature Machine Intelligence},
  pages={1--15},
  year={2023},
  publisher={Nature Publishing Group UK London}
}