tatsuhikonaito / DEEP-HLA

Upload test
Other
19 stars 10 forks source link
DEEP-HLA_logo

DEEP*HLA

DOI

DEEP*HLA is an HLA allelic imputation method based on a multi-task convolutional neural network implemented in Python.

DEEP*HLA receives pre-phased SNV data and outputs genotype dosages of binary HLA alleles.

In DEEP*HLA, HLA imputation is performed in two processes:

​ (1) model training with an HLA reference panel

​ (2) imputation with a trained model.

Publication/Citation

The study of DEEP*HLA is described in the manuscript.

Please cite this paper if you use DEEP*HLA or any material in this repository.

Requirements

DEEP*HLA was tested on the versions in parentheses, so we do not guarantee that it will work on different versions.

Installation

Just clone this repository as folllows.

git clone https://github.com/tatsuhikonaito/DEEP-HLA
cd ./DEEP-HLA

Usage

0. Original file formats

The original files for model and HLA information are needed to run DEEP*HLA.

1. Model training

Run train.py on a command-line interface as follows.

Sample files should have only the MHC region extracted for HLA imputation (typically, chr6:29-34 or 24-36 Mb). In addition, the strands must be consistent between the sample and reference data.

HLA reference data are currently only supproted in Beagle-phased format.

$ python train.py --ref REFERENCE (.bgl.phased/.bim) --sample SAMPLE (.bim) --model MODEL (.model.json) --hla HLA (.hla.json) --model-dir MODEL_DIR
Arguments and options
Option name Descriptions Required Default
--ref HLA reference data (.bgl.phased, and .bim format). Yes None
--sample Sample SNP data of the MHC region (.bim format). Yes None
--model Model configuration (.model.json format). Yes None
--hla HLA information of the reference data (.hla.json format). Yes None
--model-dir Directory for saving trained models. No {BASE_DIR}/model
--num-epoch Number of epochs to train. No 100
--patience Patience for early-stopping. If you prefer no early-stopping, specify the same value as --num-epoch. No 16
--val-split Ratio of splitting data for validation. No 0.1
--max-digit Maximum resolution of alleles to impute ("2-digit", "4-digit", or "6-digit"). No "4-digit"
Outputs

2. Imputation

After you have finished training a model, run impute.py as follows.

Phased sample data are supported in Beagle-phased format and Oxford haps format (SHAPEIT, Eagle, etc.).

$ python impute.py --sample SAMPLE (.bgl.phased (.haps)/.bim/.fam) --model MODEL (.model.json) --hla HLA (.hla.json) --model-dir MODEL_DIR --out OUT
Arguments and options
Option name Descriptions Required Default
--sample Sample SNP data of the MHC region (.bgl.phased or .haps, .bim, and .fam format). Yes None
--phased-type File format of sample phased file ("bgl" or "haps"). No "bgl"
--model Model configuration (.model.json and .bim format). Yes None
--hla HLA information of the reference data (.hla.json format). Yes None
--model-dir Directory where trained models are saved. No {BASE_DIR}/model
--out Prefix of output files. Yes None
--max-digit Maximum resolution of alleles to impute ("2-digit", "4-digit", or "6-digit"). No "4-digit"
--mc-dropout Whether to calculate uncertainty by Monte Carlo dropout (True or False). No False
Outputs

Example

Here, we demonstrate a practical usage with an example of Pan-Asian reference panel.

The trained models have already been stored in Pan-Asian/model, so you can skip the model training process.

0. Data preparation

First, dowload Pan-Asian reference panel data and example data at SNP2HLA dowload site.

Perform pre-phasing of the example data with any phasing software (SHAPEIT, Eagle, and Beagle, etc.), and generate a 1958BC.haps (or .bgl.phased) file.

Put them into Pan-Asian directory.

DEEP-HLA/
 └ Pan-Asian/
   ├ Pan-Asian_REF.bgl.phased
   ├ Pan-Asian_REF.bim
   ├ Pan-Asian_REF.config.json
   ├ Pan-Asian_REF.info.json
   ├ 1958BC.haps (or .bgl.phased)
   ├ 1958BC.bim
   ├ 1958BC.fam
   └ model/

1. Model training

We have already uploaded a trained model, so you can skip this step.

Otherwise, run train.py as follows. The files in Pan-Asian/model directory will be overwritten.

$ python train.py --ref Pan-Asian/Pan-Asian_REF --sample Pan-Asian/1958BC --model Pan-Asian/Pan-Asian_REF --hla Pan-Asian/Pan-Asian_REF --model-dir Pan-Asian/model

2. Imputation

Run impute.py as follows.

$ python impute.py --sample Pan-Asian/1958BC --phased-type haps --model Pan-Asian/Pan-Asian_REF --hla Pan-Asian/Pan-Asian_REF --model-dir Pan-Asian/model --out Pan-Asian/1958BC

3. Imputation of amino acid polymorphisms

Run impute_aa.py as follows.

$ python impute_aa.py --dosage Pan-Asian/1958BC --aa-table Pan-Asian/Pan-Asian_REF --out Pan-Asian/1958BC

4. Other HLA referece panels

Please follow the application process to obtain the two reference panels used in our study.

Their related files for imputation (.model.json, .hla.json, and .aa_table.pickle) may be provided upon request.

Reference

DEEP*HLA uses MGDA-UB (Multiple Gradient Descent Algorithm - Upper Bound) for multi-task learning, and the source code of its part is implemented with the modification of MultiObjectiveOptimization.

Contact

For any question, you can contact Tatsuhiko Naito (tnaito@sg.med.osaka-u.ac.jp)

One of the advantages of DEEP*HLA is that model training can be done in another place, even without sample genotypes. We may consider tailoring a DEEP*HLA model with our own or publicly available reference panels that fits your SNP data. Please consider asking us by email individually if you have interest in it.