pathoscore evaluates variant pathogenicity tools and scores.
evaluating scores is hard because logic can be circular and benign and pathogenic sets are hard to curate and evaluate.
pathoscore
is software and datasets that facilitate applying evaluating pathogenicity scores.
The sections below describe the tools.
Annotate a vcf with some scores (which can be bed or vcf). Note that this tool is a simple wrapper around vcfanno so a user can instead use to run vcfanno directly.
python pathoscore.py annotate \
--scores exac-ccrs.bed.gz:exac_ccr:14:max \
--scores mpc.regions.clean.sorted.bed.gz:mpc_regions:5:max \
--exclude /data/gemini_install/data/gemini_data/ExAC.r0.3.sites.vep.tidy.vcf.gz \
--conf combined-score.conf \
testing-denovos.vcf.gz
The individual flags are described here:
The scores
format is path:name:column:op
where:
name becomes the new name in the INFO field.
column indicates the column number (or INFO name) to pull from the scores VCF.
op is a vcfanno
operation.
multiple annotations for the same file can be used as such:
python pathoscore.py annotate --prefix benign \
--scores score-sets/GRCh37/aloft/aloft.txt.gz:aloft_het,aloft_lof,aloft_rec:5,6,7:max,max,max \
truth-sets/GRCh37/clinvar/clinvar-benign.20170905.vcf.gz
can be a population VCF that is used to filter would-be pathogenic variants (as we know that common variants can't be pathogenic). This can also be a set of regions to exclude, and for user convenience we curated gene sets that the user can filter on such as autosomal dominant genes from Berg et al. (2013) and haploinsufficient genes from Dang et al. (2008).
an optional vcfanno conf file so users can specify exactly how to annotate if they feel comfortable doing so.
This can also be used to specify vcfanno [[postannotation]]
blocks, for example, to combine scores.
An example conf
to combine 2 scores looks like:
[[postannotation]]
name="combined"
op="lua:exac_ccr+10\*cadd"
fields=["exac_ccr", "cadd"]
type="Float"
python pathoscore.py evaluate \
-s MPC \
-s exac_ccr \
-i mpc_regions \
-s combined \
--goi listofgenesofinterest \
pathogenic.vcf.gz \
benign.vcf.gz
This will take the output(s) from annotate
and create ROC curves and score distribution plots.
It assumes that the first VCF contains pathogenic variants and the 2nd contains benign variants.
It uses the columns specified via -s
and -i
as the scores.
-i
indicates that lower scores are more constrained where as
-s
is for fields where higher scores are more constrained.
--goi
is to provide a newline delimited file of genes of interest for a clinical utility calculation. More information is provided in the wiki.
An example ROC curve for the Clinvar truth-set looks like this:
The point in the plot shows the max J Statistic which can be summarized as the point in each curve where the vertical distance to the Y=X line is maximized. This has its highest possible value at an FPR of 0 so there is an implicit penalty for having a high TPR at a high-ish FPR.
We also report the full distrubtion of J statistics:
finally, we report the proportion of benign and pathogenic variants scored in a truth-set:
These plots, along with the score-distributions for each method for pathogenic and benign, are aggregated into a single HTML report.
Download a vcfanno binary for your system and make it available as
vcfanno
on your $PATH
Then run:
pip install -r requirements.txt
Then you should be able to run the evaluation scripts.
Part of pathoscore
is to provide curated truth sets that can be used for evaluation.
These are kept in truth-sets/
. Each set has a benign and/or a pathogenic set.
Pull-requests for recipes that add new truth sets are welcomed. These should include a make.sh
script that, when run will pull from the original data source and make a benign and/or pathogenic
vcf that is bgzipped and tabixed and made as small as possible (see the clinvar example for how
to remove unneeded fields from the INFO field).
All truth-sets should be annotated with bcftools csq
so that it's possible to choose to score only
functional variants.
Currently we have:
Pathogenic
or Likely-Pathogenic
and variants with uncertainty are removed.Benign
or Likely-Benign
and variants with uncertainty are removed.These are from Kaitlin Samocha's paper on mis-sense contraint.
control
in her source file.Some alleged pathogenic variants may appear at high allele frequencies in population databases, and some users may understandably find those variants suspect. If you would like to filter out variants on allele frequency in a population set. An example conf file is provided in the repo called af.conf. If you have additional filtering parameters you'd like to specify you can also use a conf file for that as detailed in vcfanno's repo.
And then you can run the pathoscore script as below:
python pathoscore.py annotate --scores score-sets/GRCh37/MPC/mpc.txt.gz:MPC:5:max --scores score-sets/GRCh37/REVEL/revel.txt.gz:REVEL:7:max truth-sets/GRCh37/samocha/samocha.pathogenic.vcf.gz --prefix neurodev --conf af.conf
Just make sure that you don't use a file more than once in the conf file, write everything you want to do for each file in a list as shown above. Additionally, don't use any fields like --scores or --exclude to perform things on a file that is already referenced in the conf file you provide to pathoscore. It will not work.
For user convenience, under scripts/gnomad, there are make scripts for generating vt normalized, decomposed and BCSQ annotated ExAC v1 and gnomAD VCF files, so that you can filter by allele frequency in those population datasets.