generate clinvar_alleles_with_gnomad table(s)

macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.

Other

122 stars 55 forks source link

generate clinvar_alleles_with_gnomad table(s) #30

Closed bw2 closed 7 years ago

bw2 commented 7 years ago

This repo currently contains only a clinvar_alleles_with_exac_v1.single.b37.tsv.gz table, but we could generate similar tables for the new gnomAD exomes and genomes datasets (http://gnomad.broadinstitute.org/downloads). If anyone's interested in having such tables, or has preferences on table columns/format, please let me know.

kristjaneerik commented 7 years ago

FYI: I've been working with the new gnomAD release and I had to update https://github.com/macarthur-lab/clinvar/blob/master/src/add_exac_fields.py#L10 since gnomad.exomes.r2.0.1.sites.vcf.gz has AC, AN, AC_raw, AN_raw rather than AC_Adj, AN_Adj, AC, AN (there may be other column changes, those are just the ones I looked at). Might be tricky to keep compatibility between ExAC and gnomAD, though I don't immediately see a reason why older ExAC data should be used now that gnomAD is out. Otherwise everything was fine. I haven't had a chance to grab the genomes VCFs to look at those.

(Somewhat related: it'd be nice to specify an output prefix when running master.py to keep ClinVar, ExAC, gnomAD, etc versions nicely separated. I can make a PR.)

bw2 commented 7 years ago

@kristjaneerik True, I also don't see any reason to keep an ExAC-v1 table. Could you say more about the extra prefix and how current filenames+directories aren't sufficient? Also, there are a few other things that changed in the VCF and might affect the parsing code (described in https://macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/ )

bw2 commented 7 years ago

I added the gnomAD tables for the March release

kristjaneerik commented 7 years ago

@bw2 Thanks for the update! Looks like you only pushed the code this time though, and not the updated tables.

For running master.py, did you use the file from gs://gnomad-public/release-170228/vcf/genomes/gnomad.genomes.r2.0.1.sites.coding.autosomes.vcf.gz for -GG or some concatenation of the chromosome-specific VCFs, i.e. including non-coding sites?

Re: output prefix it's a bit better now that there's an output directory, but I've been working on various ClinVar datasets and it was a bit of a hassle to move and rename the generated files. I was thinking adding a flag, e.g. --output-prefix that defaults to ../output/ but which people could set. I'm working on a PR to fix some things about the new *_ordered fields and I think I'll incorporate it there, it's only a two-line change.