Open bwalsh opened 1 month ago
:+1:
I suspect that searches run against GREGoR will require different software than searches against gnomAD, so we can probably break that up conceptually. The latter could even be more of a stretch goal if necessary -- if we're running of time and just need a demo, we could always just manually construct them and leave it as a proof of concept -- but could also be generalizable beyond this project (I am not sure how much additional work we'd need to do on top of the existing gnomad utils).
From Tuesday's discussion, I think a parquet file/flat file encompassing just the patient ID/VRS data and maybe some quality parameters for filtering would be the fixture against which a gregor search variation method would run. At least, this is what I've been working on since we spoke, so someone can speak up if I'm running off in the wrong direction.
Notes 9/17: @jsstevenson - can you provide the gs:// path to the vcf(s) you are testing with?
https://github.com/ga4gh/va-spec/
https://github.com/broadinstitute/gnomad_methods - a search exists here https://github.com/genomicmedlab/gregor - james' work (ignore for now - experimental) @bwalsh TODO google storage api + tabix: skip to offset
Assumption: schema clarifications for CAF and others tobe forthcoming from AlexW and GA4GH discussion
Re. remote indexing, the following works in the AnVIL env
# set this to the remote vcf you have access to
export MY_OBJECT=gs://xxxxxx.vcf.gz
# get the auth token, tabix reads from GCS_OAUTH_TOKEN
export GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token`
# read the remote object, validate we can list headers
tabix -H $MY_OBJECT | grep -q '#CHROM' && echo 'remote access worked' || echo 'remote access failed'
# >> remote access worked
# assuming MY_OBJECT points at chrY, lets get alleles in the SRY gene
see https://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000184895;r=Y:2786855-2787682;t=ENST00000383070
tabix -p vcf $MY_OBJECT chrY:2,786,855-2,787,682 | wc -l
# >> 2
# assuming MY_OBJECT points at chr17, lets get alleles in the BRCA1 gene see https://useast.ensembl.org/Homo_sapiens/Location/View?db=core;g=ENSG00000012048;r=17:43044295-43170245
tabix -p vcf $MY_OBJECT chr17:43,044,295-43,170,245 | wc -l
# >> 2592
GREGoR next steps
Use Case:
Test Driven Development:
Fixtures:
Methods:
vcf2phenotype
vcf2caf
caf-search
Method Construction
Acceptance: