gks-anvil / vrs_anvil_toolkit

Extract clinical variant interpretations from VCF using GA4GH VRS IDs
MIT License
2 stars 1 forks source link

feature/presentation-outline #74

Closed bwalsh closed 7 months ago

bwalsh commented 7 months ago
  1. Problem Statement: Attaching VCF processing to clinical evidence optimally can be challenging
  2. What tools exist out there?
    • bcftools / pysam:
      • Does vcf processing / filtering / etc
      • [sensitive to incorrectly formatted VCF headers / limited scope]
    • vrs-python:
      • able to translate into fully justified allele using coordinate info!
      • Can even annotate your VCF w/ VRS IDs and write VRS objects
      • [Not connected to clinical evidence]
    • metakb
      • clinical evidence in a meta knowledgebase
      • Contains CIVIC and MOA data
      • has VRS IDs and specific study API
      • [no connection to VCFs]
  3. Sol: vrs_anvil annotate
    • Outline tool here (Walsh’s flow diagram) — this could be used as a reference for all other pieces
    • What it does
      • organizes settings in manifest to pulls VCFs
      • Gets allele VRS ID per variant using vrs-python w/ threading, multiprocessing, and caching
      • identifies hits to local metakb cache (hits is low, so no need to query api every time)
      • writes to logs and metrics file
      • packaged in a CLI!
  4. How we might use it: (1000G proof of concept)
    • Gather VCFs required through Terra
    • Create manifest.yaml
    • Run nohup vrs_anvil annotate —scatter &
      • Analysis: 1000-figures.ipynbGet the study IDs associated with each metakb cache hit
      • Get % samples with variant match
      • Visualize number of samples per patient
      • Give a few example descriptions of study hits
  5. Discussion: Pros and Cons and when to use
    • vrs-python VCFAnnotator: direct from source, need translation only, need annotated VCFs
    • metakb api: already have set of identifiers and just want study results
    • vrs_anvil : need threading / multiprocessing baked in, wanna get to metakb, error handling, organized file runs [LOOK AT ISSUES resolved to see features]
  6. Further work:
    • Cohort Allele Frequency data
    • ???
  7. Wanna experiment? Use this? Contribute?