mvPPT

mvPPT website is at: http://www.mvppt.club/
or you can get scores at google drive. A comprehensive prediction tool, mvPPT (Pathogenicity Prediction Tool for missense variants).

Training data

Three training sets based on different combinations of variants from ClinVar, HGMD, Uniprot, and Genome Aggregation Database (gnomAD).

Annotation

Variants were annotated by the ANNOVAR.

A) AFs, AAFs, and GFs of variants estimated from 125,748 exomes in gnomAD (version 2.1.1);
B) genomic context of the variant, i.e., region/gene-based information from GeVIR, VIRLoF, oe mis upper, HIP, CCRs, Interpro domain, and amino acid sequence before and after mutation;
C) pathogenicity likelihood scores assessed by different component tools, including MutationAssessor, SIFT, PROVEAN, GERP++ RS, phyloP, phastCons, and SiPhy.

We annotated datasets with ANNOVAR using dbNSFP (v.4.1a, see URLs) to generate some of the required prediction scores from different component tools, including Interpro domain, MutationAssessor, phyloP, GERP, phastCons, PROVEAN, and SiPhy. Mutations located in the interpro domains were recorded as 1 and the rest were recorded as 0. AFs, GFs, and AAFs of each variant in different populations were obtained from the gnomAD exomes database. AFs, AAFs, HomFs, and HetFs were assigned 0 and WtFs were assigned 1 if the variant was not present in the database. The GeVIR, VIRLoF, oe mis upper, HIP, and CCR scores were downloaded from their respective websites (see URLs). One-hot encoding has been applied to amino acid sequence, representing each amino acid with a binary vector of length 20 with a single non-zero value. All the features were selected to provide complementary information, and they either did not require training or their training data are publicly available to allow exclusion from our data.

The MVP, REVEL, PrimateAI, FATHMM-XF, ClinPred, MetaSVM/MetaLR, PolyPhen2, and VEST4 scores were obtained from dbNSFP v4.1a. The M-CAP (version 1.4), MISTIC , CAPICE ReVe, and CADD (version 1.6) scores were downloaded from their respective websites.

Training

mvPPT was trained using the python package LightGBM (version 2.3.1), and parameters were tuned by Bayesian optimization(version 1.2.0). The random status was set as 1 throughout the model training process.

Environments

The environments of mvPPT built in our study:

python 3.7.4
sklearn 0.22.1
numpy 1.17.3
scipy 1.4.1
pandas 0.25.3
matplotlib 3.1.2
lightGBM 2.3.1
bayesian-optimization 1.1.0

How to predict the score by yourself

runtime environment (We recommend conda)

conda create -n mvppt python==3.7
conda activate mvppt
conda install --file=requirements_conda.txt

annotation
- You need to apply for annovar and add the scripts to annovar
- We upload some database to annodb because of the data size
- You need to unzip the files in the annodb
- you can download ensGene and dbnsfp35a by annovar
- we provide the demo of gnomad211exoms_allpop and p6b, you have to get it by yourself from gnomad_v211 and dbnsfp41a
- for gnomad_AAF.txt.gz
```
cat xaa xab > gnomad_AAF.txt.gz
gunzip gnomad_AAF.txt.gz
```
```
bash src/anno.sh filename.vcf
python3 src/annoseq.py filenameGeneUniqAnno.txt filenameTotalGeneUniqAnnoEnsSeq.txt
```

predict

python3 src/predict.py filenameTotalGeneUniqAnnoEnsSeq.txt

tongshiyuan / mvPPT

readme