tongshiyuan / mvPPT

A comprehensive prediction tool, mvPPT (Pathogenicity Prediction Tool for missense variants).
MIT License
5 stars 1 forks source link

mvPPT

mvPPT website is at: http://www.mvppt.club/
or you can get scores at google drive. A comprehensive prediction tool, mvPPT (Pathogenicity Prediction Tool for missense variants).

Training data

Three training sets based on different combinations of variants from ClinVar, HGMD, Uniprot, and Genome Aggregation Database (gnomAD).

Annotation

Variants were annotated by the ANNOVAR.

We annotated datasets with ANNOVAR using dbNSFP (v.4.1a, see URLs) to generate some of the required prediction scores from different component tools, including Interpro domain, MutationAssessor, phyloP, GERP, phastCons, PROVEAN, and SiPhy. Mutations located in the interpro domains were recorded as 1 and the rest were recorded as 0. AFs, GFs, and AAFs of each variant in different populations were obtained from the gnomAD exomes database. AFs, AAFs, HomFs, and HetFs were assigned 0 and WtFs were assigned 1 if the variant was not present in the database. The GeVIR, VIRLoF, oe mis upper, HIP, and CCR scores were downloaded from their respective websites (see URLs). One-hot encoding has been applied to amino acid sequence, representing each amino acid with a binary vector of length 20 with a single non-zero value. All the features were selected to provide complementary information, and they either did not require training or their training data are publicly available to allow exclusion from our data.

The MVP, REVEL, PrimateAI, FATHMM-XF, ClinPred, MetaSVM/MetaLR, PolyPhen2, and VEST4 scores were obtained from dbNSFP v4.1a. The M-CAP (version 1.4), MISTIC , CAPICE ReVe, and CADD (version 1.6) scores were downloaded from their respective websites.

Training

mvPPT was trained using the python package LightGBM (version 2.3.1), and parameters were tuned by Bayesian optimization(version 1.2.0). The random status was set as 1 throughout the model training process.

Environments

The environments of mvPPT built in our study:

How to predict the score by yourself

  1. runtime environment (We recommend conda)
conda create -n mvppt python==3.7
conda activate mvppt
conda install --file=requirements_conda.txt
  1. annotation

    • You need to apply for annovar and add the scripts to annovar
    • We upload some database to annodb because of the data size
    • You need to unzip the files in the annodb
    • you can download ensGene and dbnsfp35a by annovar
    • we provide the demo of gnomad211exoms_allpop and p6b, you have to get it by yourself from gnomad_v211 and dbnsfp41a
    • for gnomad_AAF.txt.gz
      cat xaa xab > gnomad_AAF.txt.gz
      gunzip gnomad_AAF.txt.gz
      bash src/anno.sh filename.vcf
      python3 src/annoseq.py filenameGeneUniqAnno.txt filenameTotalGeneUniqAnnoEnsSeq.txt
  2. predict

    python3 src/predict.py filenameTotalGeneUniqAnnoEnsSeq.txt