Problem Statement: Attaching VCF processing to clinical evidence optimally can be challenging
What tools exist out there?
bcftools / pysam:
Does vcf processing / filtering / etc
[sensitive to incorrectly formatted VCF headers / limited scope]
vrs-python:
able to translate into fully justified allele using coordinate info!
Can even annotate your VCF w/ VRS IDs and write VRS objects
[Not connected to clinical evidence]
metakb
clinical evidence in a meta knowledgebase
Contains CIVIC and MOA data
has VRS IDs and specific study API
[no connection to VCFs]
Sol: vrs_anvil annotate
Outline tool here (Walsh’s flow diagram) — this could be used as a reference for all other pieces
What it does
organizes settings in manifest to pulls VCFs
Gets allele VRS ID per variant using vrs-python w/ threading, multiprocessing, and caching
identifies hits to local metakb cache (hits is low, so no need to query api every time)
writes to logs and metrics file
packaged in a CLI!
How we might use it: (1000G proof of concept)
Gather VCFs required through Terra
Create manifest.yaml
Run nohup vrs_anvil annotate —scatter &
Analysis: 1000-figures.ipynbGet the study IDs associated with each metakb cache hit
Get % samples with variant match
Visualize number of samples per patient
Give a few example descriptions of study hits
Discussion: Pros and Cons and when to use
vrs-python VCFAnnotator: direct from source, need translation only, need annotated VCFs
metakb api: already have set of identifiers and just want study results
vrs_anvil : need threading / multiprocessing baked in, wanna get to metakb, error handling, organized file runs [LOOK AT ISSUES resolved to see features]