ohsu-comp-bio / vrs_anvil_toolkit

Extract clinical variant interpretations from VCF using GA4GH VRS IDs
MIT License
2 stars 0 forks source link

GREGoR Processing #92

Open bwalsh opened 1 month ago

bwalsh commented 1 month ago

GREGoR next steps

Use Case:

As a GREGoR analyst, in order to discover genotype to phenotype associations, I would like to compare CAF objects of cohorts from the GREGoR Consortium consortium with CAF objects from the gnomAD consortium

Test Driven Development:

Fixtures:

Methods:

Method Construction

Acceptance:

jsstevenson commented 1 month ago

:+1:

I suspect that searches run against GREGoR will require different software than searches against gnomAD, so we can probably break that up conceptually. The latter could even be more of a stretch goal if necessary -- if we're running of time and just need a demo, we could always just manually construct them and leave it as a proof of concept -- but could also be generalizable beyond this project (I am not sure how much additional work we'd need to do on top of the existing gnomad utils).

From Tuesday's discussion, I think a parquet file/flat file encompassing just the patient ID/VRS data and maybe some quality parameters for filtering would be the fixture against which a gregor search variation method would run. At least, this is what I've been working on since we spoke, so someone can speak up if I'm running off in the wrong direction.

bwalsh commented 1 month ago

Notes 9/17: @jsstevenson - can you provide the gs:// path to the vcf(s) you are testing with?

https://github.com/ga4gh/va-spec/

https://github.com/broadinstitute/gnomad_methods - a search exists here https://github.com/genomicmedlab/gregor - james' work (ignore for now - experimental) @bwalsh TODO google storage api + tabix: skip to offset

Assumption: schema clarifications for CAF and others tobe forthcoming from AlexW and GA4GH discussion

bwalsh commented 1 month ago

Re. remote indexing, the following works in the AnVIL env

# set this to the remote vcf you have access to
export MY_OBJECT=gs://xxxxxx.vcf.gz

# get the auth token, tabix reads from GCS_OAUTH_TOKEN
export GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token`

# read the remote object, validate we can list headers
tabix -H $MY_OBJECT | grep -q '#CHROM' && echo 'remote access worked' || echo 'remote access failed'
# >> remote access worked

# assuming MY_OBJECT points at chrY,  lets get alleles in the SRY gene 
 see https://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000184895;r=Y:2786855-2787682;t=ENST00000383070

tabix -p vcf   $MY_OBJECT chrY:2,786,855-2,787,682 | wc -l
# >> 2

# assuming  MY_OBJECT points at chr17,  lets get alleles in the BRCA1 gene see https://useast.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000012048;r=17:43044295-43170245

tabix -p vcf   $MY_OBJECT chr17:43,044,295-43,170,245  | wc -l
# >> 2592