Feature requests for VCF2DB compatible with GEMINI built-in tools

Phillip-a-richmond commented 6 years ago

Hello, I would like to request that the tables used by GEMINI's built-in analysis tools be added into VCF2DB.

Ideally, all tables that are default loaded with the command:

$ gemini load

that are not inherently third party annotations, would be added into the resulting database.

Examples I have run into so far: depths sample_genotype_counts

More complex features that are ideal for our pipeline, but we may need to resort to standard GEMINI load to use: gene_summary pathways and gene detailed analyses

Thanks, Phil

brentp commented 6 years ago

I have added the sample_genotype_counts table. I am not sure what you mean by depths table. I don't intend to add the gene tables to vcf2db, but I could be convinced to change my mind given a reasonable use-case.

Phillip-a-richmond commented 6 years ago

Thanks Brent, I'll pull this version and test the GEMINI built-in functions.

Essentially, I am striving to produce a fully functional single variant database for identifying pathogenic variants underlying rare mendelian genetic diseases (and the Quinlan lab tools are excellent for this). The problem with only using GEMINI load is the lack of flexibility when it comes to the annotations available (e.g. TrAP, FATHMM-XF, in-house variant databases, monthly updating ClinVar vcfs). And annotating after-the-fact with GEMINI annotate is prohibitively slow for large annotation databases (especially genome-wide databases) across WGS variant datasets.

VCFAnno+VCF2DB provides flexibility with this respect, and it's fast. However, lacking some of the tables which are loaded in GEMINI by default, like the one that you fixed above, causes the GEMINI built-in functions to fail. Ideally, a workflow that goes from VCF-->DB, and then can use GEMINI to query the DB for inheritance patterns, runs of homozygosity, variants within specific gene sets, harmony between noncoding and coding variants, would be ideal.

Thanks for your help on this, Phil

brentp commented 6 years ago

could you enumerate what is missing for you so I can prioritize?

naumenko-sa commented 6 years ago

Hi Brent!

The difference between gemini load and vcf2db (loaded in bcbio 1.0.7 with vcfanno: [gemini] and by default): variants table: pfam_domain = domains? aaf_gnomad_all = gnomad_af gnomad_num_het = absent, possible to add? ghomad_num_hom = absent, possible to add? cadd_scaled = absent, possible to add? vep_hgvsc = hgvsc vep_hgvsp = hgvsp aaf_esp_aa = af_esp_aa aaf_esp_ea = af_esp_ea aaf_esp_all = af_esp_all is_conserved = absent, possible to add?

variant_impacts table: vep_canonical = canonical vep_ccds = ccds vep_hgvsc = hgvsc vep_hgvsp = hgvsp vep_maxentscan_diff = maxentscan_diff vep_maxentscan_alt = maxentscan_alt vep_maxentscan_ref = maxentscan_ref vep_spliceregion = spliceregion

Is there a way for downstream scripts to get the creator of gemini.db (gemini load or vcf2db) to apply different processing logic?

Is it possible to add gnomad_num_hemi? https://groups.google.com/forum/#!topic/gemini-variation/knRmriYXDW4

Thanks! Sergey

brentp commented 6 years ago

but these are things that you have control over, correct? in most cases, vcf2db.py just pull what's present in the INFO field. You can change the vcfanno conf if you want different names. Am I missing something?

naumenko-sa commented 6 years ago

Thanks Brent! yes, you are right, it is not an issue of vcfanno/vcfdb, it is a way of wrapping annotation in bcbio. SN

Phillip-a-richmond commented 6 years ago

Pulled on January 22nd 2018.

Tested and confirmed to work:

gemini autosomal_dominant
gemini autosomal_recessive
gemini comp_hets
gemini de_novo
gemini db_info
gemini query
gemini x_linked_de_novo
gemini x_linked_dominant
gemini x_linked_recessive
gemini burden
gemini region
gemini stats
gemini lof_sieve
gemini mendel_errors

Tested and failed:

gemini roh

Details: "Depths" table, as referenced from gemini roh Example:

$ gemini roh T008.db LOG: Querying and ordering variants by chromosomal position. SQL error: (sqlite3.OperationalError) no such column: depth [SQL: u"select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants WHERE type = 'snp' AND filter is NULL AND depth >= 20 ORDER BY chrom, end"]

SQL error: (sqlite3.OperationalError) no such column: depth [SQL: u"select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants WHERE type = 'snp' AND filter is NULL AND depth >= 20 ORDER BY chrom, end"] Traceback (most recent call last): File "/opt/tools/gemini/bin/gemini", line 7, in gemini_main.main() File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main args.func(parser, args) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1136, in homozygosity_runs_fn run(parser, args) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_homozygosity_runs.py", line 215, in run get_homozygosity_runs(args) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_homozygosity_runs.py", line 162, in get_homozygosity_runs gq.run(query, needs_genotypes=True) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 653, in run self.result_proxy = res = iter(self._apply_query()) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 906, in _apply_query res = self._execute_query() File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 883, in _execute_query raise ValueError("The query issued (%s) has a syntax error." % self.query) ValueError: The query issued (select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants WHERE type = 'snp' AND filter is NULL AND depth >= 20 ORDER BY chrom, end) has a syntax error.

gemini pathways

$ gemini pathways --lof -v 71 T008.db chrom start end ref alt impact sample genotype gene transcript pathway Traceback (most recent call last): File "/opt/tools/gemini/bin/gemini", line 7, in gemini_main.main() File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main args.func(parser, args) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 768, in pathway_fn tool_pathways.pathways(parser, args) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_pathways.py", line 155, in pathways get_ind_lof_pathways(conn, metadata, args) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_pathways.py", line 143, in get_ind_lof_pathways _report_variant_pathways(res, args, idx_to_sample) File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_pathways.py", line 103, in _report_variant_pathways pathlist])) TypeError: sequence item 5: expected string or Unicode, NoneType found

Priority for our application purposes would include fixing gemini ROH. The pathways-based analysis is a very low priority for us at this time.

Thanks, Phil

robinvanderlee commented 6 years ago

Hi,

Following up on @Phillip-a-richmond last comment, I tried running gemini roh on a gemini database produced with vcf2db.

I am getting the same errors, indicating that the depth column is missing:

$ gemini roh gemini_db_produced_by_vcf2db.db
LOG: Querying and ordering variants by chromosomal position.
SQL error: (sqlite3.OperationalError) no such column: depth [SQL: u"select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants               WHERE type = 'snp'               AND   filter is NULL               AND   depth >= 20 ORDER BY chrom,  end"]
SQL error: (sqlite3.OperationalError) no such column: depth [SQL: u"select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants               WHERE type = 'snp'               AND   filter is NULL               AND   depth >= 20 ORDER BY chrom,  end"]
Traceback (most recent call last):
  File "/opt/tools/gemini/bin/gemini", line 7, in <module>
    gemini_main.main()
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main
    args.func(parser, args)
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1136, in homozygosity_runs_fn
    run(parser, args)
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_homozygosity_runs.py", line 215, in run
    get_homozygosity_runs(args)
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_homozygosity_runs.py", line 162, in get_homozygosity_runs
    gq.run(query, needs_genotypes=True)
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 653, in run
    self.result_proxy = res = iter(self._apply_query())
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 906, in _apply_query
    res = self._execute_query()
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 883, in _execute_query
    raise ValueError("The query issued (%s) has a syntax error." % self.query)
ValueError: The query issued (select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants               WHERE type = 'snp'               AND   filter is NULL               AND   depth >= 20 ORDER BY chrom,  end) has a syntax error.

I think perhaps some of the previous confusion stemmed from called depth a table whereas it seems to be a column. Would it be possible to include the depth column to the list of annotations that vcf2db builds into the gemini db?

Thanks for all the hard work on these tools! Robin

quinlan-lab / vcf2db