hwanglab / divine

Divine: Prioritizing Genes for Rare Mendelian Disease in Whole Exome Sequencing Data
12 stars 1 forks source link

Divine_workflow

Divine : Prioritizing Genes for Rare Mendelian Disease in Whole Exome Sequencing Data

Divine is designed to make a daily-routine molecular diagnosis with high-throughput whole exome sequencing data more efficient. Using both patient phenotypic information and genetic variants, Divine that integrates patients’ phenotype(s) and WES data with 30 prior biological knowledge (e.g., human phenotype ontology, gene ontology, pathway database, protein-protein interaction networks, etc.) to prioritize potential disease-causing genes.

Divine_workflow

Website

https://github.com/hwanglab/divine

Tutorial

A tutorial is available, from the installation to case studies.

Input

Algorithm

Output

Developer note

Setup

Prerequisite:

Python modules to be installed

Divine requires the following modules but, during the setup process, the modules will be installed automatically if necessary.

Install

download divine source codes from github

$ git clone https://github.com/hwanglab/divine.git

Option 1: fresh install

It requires downloading 21 GB database files and so be patient!

$ setup.py --install

Optional 2: Previously installed but want to upgrade Divine database/resource/examples/3rd-party python modules (when you previously installed Divine)

$ setup.py --install --update_db

Configuration

Uninstall

Optional 1: only uninstall python modules of dependency

$ setup.py --uninstall

Optional 2: also, uninstall resource files

$ setup.py --uninstall --remove_db

Usage

Input: a text file containing HPO IDs

First, visit either http://compbio.charite.de/phenomizer, https://hpo.jax.org/app, or https://mseqdr.org/search_phenotype.php. Enter the patient phenotype terms, description, or specific keywords you think important. Get the best matching HPO IDs. Paste the HPO IDs in the format of HP:XXXXXXX (e.g., HP:0002307) into a text file line by line and save it as a text file (e.g., P0001.hpo)

For example, an HPO file looks like

$ cat P0001.hpo
#my_patient_ID
HP:0002307
HP:0000639
HP:0001252
HP:0100543
HP:0002120
HP:0000708
HP:0001344
HP:0008872
HP:0000510
HP:0001513
HP:0006979
HP:0000752
HP:0012469
HP:0000577
HP:0001010
HP:0006887
HP:0002650
HP:0005469
HP:0002312
HP:0010808
HP:0002136
HP:0200085
HP:0002311

Input: a VCF file

Any VCF file following the standard format (e.g., https://samtools.github.io/hts-specs/VCFv4.2.pdf). Refer to by GATK, samtools, or, freebayes to convert FASTQ to VCF file.

Output

Known disease matching by patient phenotypes

hpo_to_diseases.tsv: From an input HPO file, Divine prioritize which disease the patient likely has. The output format is

$ head -n3 hpo_to_disease.tsv
#disease_ID    genes    score[FunSimMax]
OMIM:101600    FGFR1,FGFR2    0.000911782
OMIM:101200    FGFR2    0.000674322
:
:

Gene rank

$ head -n6 gene_rank.tsv
#gene   predicted_score seed_score      gt_dmg_score    pheno_score     contain_known_pathogenic
FGFR2   0.00925436      1.60806e-06     0.00545116      0.00117607      NA
EVC     0.00634702      1.08469e-06     0.00981397      0.000439834     NA
VPS13B  0.00477042      8.25496e-07     0.00322986      0.00102016      NA
CCT4    0.0045629       7.92439e-07     0.00648582      0.000487018     NA
ZMYND11|ENSP00000452959 0.00405376      6.66455e-07     0.00745641      0.000356124     NA

Gene enrichment from known diseases matched by patient phenotypes

$ head -n6 diseases_rank.tsv

Annotated VCF files

Prioritized genes and Microsoft Excel file

Log files

Cases

$ divine.py -q dir_to_the_hpo/P0001.hpo -o dir_to_output/P0001
$ divine.py -v dir_to_the_vcf/P0002.vcf -o dir_to_output/P0002
$ divine.py -q dir_to_the_hpo/P0003.hpo -v dir_to_the_vcf/P0003.vcf -o dir_to_output/P0003
$ divine.py -q dir_to_the_hpo/proband.hpo -v dir_to_the_vcf/multisample.vcf -f dir_to_ped/pedigree.ped -p proband_id -o dir_to_output/proband_id

Examples

We include 4 to 5 demo samples in the resource package,

$ cd $DIVINE/gcn/bin/prioritize/examples
$ ./runme_angelman.sh #when only HPO data is available
$ ./runme_pfeisffer_noHpo.sh #when only VCF is available 
$ ./runme_pfeisffer.sh #when both HPO and VCF are available
$ ./runme_millerSyndrome.sh #when both HPO and VCF are available
$ ./runme_trio.sh #analyze family samples (PED file requires and sample ID should be matched with the ones in VCF file)

Help

usage: divine.py [-h] [-q HPO_QUERY] [-v VCF] [-o OUT_DIR] [-c VCF_FILTER_CFG]
                 [-f PED] [-p PROBAND_ID] [-d EXP_TAG] [-i INDEL_FIDEL]
                 [-K TOP_K_DISEASE] [-r GO_SEED_K] [-e REF_EXON_ONLY]
                 [-C CADD] [-j COSMIC] [-D DBLINK] [-H HGMD] [-k VKNOWN]
                 [-t CAPKIT] [--reuse]

Divine (v0.1.2) [author:hongc2<at>ccf.org]

optional arguments:
  -h, --help            show this help message and exit
  -q HPO_QUERY, --hpo HPO_QUERY
                        Input patient HPO file. A file contains HPO IDs (e.g.,
                        HP:0002307), one entry per line. Refer to
                        http://compbio.charite.de/phenomizer or
                        https://mseqdr.org/search_phenotype.php
  -v VCF, --vcf VCF     input vcf file
  -o OUT_DIR, --out_dir OUT_DIR
                        output directory without white space. If not exist,
                        the directory will be created.
  -c VCF_FILTER_CFG, --vcf_filter_cfg VCF_FILTER_CFG
                        vcf filter configuration file [None]
  -f PED, --family_fn PED
                        family pedigree file [None]
  -p PROBAND_ID, --proband_id PROBAND_ID
                        proband sample ID [None]
  -d EXP_TAG, --exp_tag EXP_TAG
                        specify experiment tag without white space. The tag
                        will be contained in the output file name.[None]
  -i INDEL_FIDEL, --indel INDEL_FIDEL
                        the level of fidelity of indell call in VCF, [1]:low
                        (e.g., samtools), 2:high (GATK haplotype caller)
  -K TOP_K_DISEASE      focus on top-K disease associated with the input
                        phenotypes [0], set 0 to consider all
  -r GO_SEED_K, --go_seed_k GO_SEED_K
                        the number of top-k diseases for GO enrichment [3];
                        set to 0 to disable
  -e REF_EXON_ONLY, --ref_exon_only REF_EXON_ONLY
                        the annotation process only runs on RefSeq coding
                        regions 0:No, [1]:Yes
  -C CADD, --cadd CADD  use CADD prediction score, 0:No, [1]:Yes
  -j COSMIC, --cosmic COSMIC
                        enable COSMIC, [0]:No, 1:Yes
  -D DBLINK, --dblink DBLINK
                        enable dblink, [0]:No, 1:Yes
  -H HGMD, --hgmd HGMD  enable HGMD (requires a license), [0]:No, 1:Yes
  -k VKNOWN, --vknown VKNOWN
                        apply variant-level pathogenic annotation (e.g.,
                        either ClinVar or HGMD) to prioritization strategy,
                        0:No, [1]:Yes
  -t CAPKIT             capture kit symbol [None],SureSelect_V6,SeqCapEZ_Exome
  --reuse               Reuse previous annotation file (divine.vcf) if it is
                        available [False]

FAQ

[fltr]
excl=LowDP:LowGQ:LowGQX:LowQual:IndelGap:SnpGap
#incl=PASS
[infoflag]
excl=DB137
[infoval]
kgaf=yes
espaf=yes
exacaf=yes
splice_dist=20
hgmd_filter=2
regulome=no
#min_depth=10
[reg]
incl=CodingExonic:NonCodingExonic:CodingIntronic:NonCodingIntronic
[freq]
incl=0.01
[freq_cli]
incl=0.05
[gid]
min=0.1
$ $DIVINE/gcn/bin/prioritize/divine.py -q dir_to_the_hpo/P0005.hpo \
   -v dir_to_the_vcf/P0005.vcf --reuse -d rev -o dir_to_the_output/P0005
$ your_divine_command 2>&1 | tee divine_err.log

Change Log

License

GNU GENERAL PUBLIC LICENSE https://www.gnu.org/licenses/gpl-3.0.en.html

Disclaimer

Not intended for direct clinical application. Divine suggests an order of genes to be inspected so that it can make molecular diagnosis efficient. The validation is the responsibility of the user. Neither Divine developer nor any software module integrated is responsible for clinical actions that may result from the use of this software. By using this tool, the user assumes all responsibility for any information that may be generated.

Reference

Contact