Closed bw2 closed 7 years ago
FYI: I've been working with the new gnomAD release and I had to update https://github.com/macarthur-lab/clinvar/blob/master/src/add_exac_fields.py#L10 since gnomad.exomes.r2.0.1.sites.vcf.gz
has AC
, AN
, AC_raw
, AN_raw
rather than AC_Adj
, AN_Adj
, AC
, AN
(there may be other column changes, those are just the ones I looked at).
Might be tricky to keep compatibility between ExAC and gnomAD, though I don't immediately see a reason why older ExAC data should be used now that gnomAD is out.
Otherwise everything was fine. I haven't had a chance to grab the genomes VCFs to look at those.
(Somewhat related: it'd be nice to specify an output prefix when running master.py
to keep ClinVar, ExAC, gnomAD, etc versions nicely separated. I can make a PR.)
@kristjaneerik True, I also don't see any reason to keep an ExAC-v1 table. Could you say more about the extra prefix and how current filenames+directories aren't sufficient? Also, there are a few other things that changed in the VCF and might affect the parsing code (described in https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/ )
I added the gnomAD tables for the March release
@bw2 Thanks for the update! Looks like you only pushed the code this time though, and not the updated tables.
For running master.py
, did you use the file from gs://gnomad-public/release-170228/vcf/genomes/gnomad.genomes.r2.0.1.sites.coding.autosomes.vcf.gz
for -GG
or some concatenation of the chromosome-specific VCFs, i.e. including non-coding sites?
Re: output prefix it's a bit better now that there's an output
directory, but I've been working on various ClinVar datasets and it was a bit of a hassle to move and rename the generated files. I was thinking adding a flag, e.g. --output-prefix
that defaults to ../output/
but which people could set. I'm working on a PR to fix some things about the new *_ordered
fields and I think I'll incorporate it there, it's only a two-line change.
This repo currently contains only a
clinvar_alleles_with_exac_v1.single.b37.tsv.gz
table, but we could generate similar tables for the new gnomAD exomes and genomes datasets (http://gnomad.broadinstitute.org/downloads). If anyone's interested in having such tables, or has preferences on table columns/format, please let me know.