quinlan-lab / vcf2db

create a gemini-compatible database from a VCF
MIT License
55 stars 13 forks source link

vcf2db cli error #42

Closed matthdsm closed 6 years ago

matthdsm commented 6 years ago

Hi Brent,

I got the following error using vcf2db:

[login] matdsmet:2018-04-21_NSQ_147 $ vcf2db.py D1514352-joint-gatk-haplotype-joint-annotated-decomposed.vcf.gz D1514352.ped D1514352-joint-gatk-haplotype-joint.db                                      [9:33:35]
/Shared/Software/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'D1514352-joint'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
/Shared/Software/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
WARNING: unknown severity for '-|frameshift_variant&start_lost&start_retained_variant|HIGH|HRNR|388697|Transcript|NM_001009931.2|protein_coding|2/3||NM_001009931.2:c.1del|NP_001009931.1:p.Met1?|80/9632|1/8553|1/2850|M/X|Atg/tg|rs34061715&COSM111478|1||-1||deletion|EntrezGene|HGNC:20846|YES||||NP_001009931.1||||rseq_mrna_match|RefSeq|T|T||||||||0.7337|0.8818|0.9544|0.9592|0.8875|||0.9028|0.7227|0.8276|0.9554|0.9063|0.9541|0.9411|0.9142|0.9069|0.9592|EUR||0&1|0&1||||||HC|||||||'. using LOW for [u'frameshift_variant', u'start_lost', u'start_retained_variant']
Please report this on github with the effect-string above

It seems vcf2db has a problem with some data inside the CSQ tag from VEP. I also noticed not all data from the CSQ tag makes it into the db. Especially annoying for us is that for some reason, the canonical column is nowhere to be found, even though it's in the VCF.

I've included a sample vcf, so you might be able to replicate the error. The vcf was created using bcbio v1.0.9

Cheers and thanks for the help M

matthdsm commented 6 years ago

Correction, the missing column only occurs in db's created by bcbio! Ping @chapmanb. Do you have any idea where this could occur? When manually creating a db from the decomposed vcf in the final dir, all data is present, but the pregenerated db's are missing the canonical column...

M

brentp commented 6 years ago

could be that bcbio needs to update either vcf2db or geneimpacts ?

matthdsm commented 6 years ago

Hi Brent,

Thanks for the reply. I don't think so, bcbio has the latest versions available in conda...

# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
geneimpacts               0.3.4                    py27_0    bioconda

M

chapmanb commented 6 years ago

Matthias and Brent; Sorry about the issue and thanks for cc'ing me into the conversation. I'm not sure why you'd see a different between bcbio vcf2db.py runs and those done by hand. bcbio isn't doing much fancy here beyond calling the command:

https://github.com/bcbio/bcbio-nextgen/blob/1083b6768511c20c5d006b0a49758fc6d3bb2167/bcbio/variation/population.py#L67

If you look in log/bcbio-nextgen-commands.log, is the same command being run that you're testing outside? Are there any other differences that could help us explore what is happening? Thanks for the help debugging.

matthdsm commented 6 years ago

Hi Brent, Brad.

I did some more digging, and I think the error is coming from the geneimpacts module. Our HPC cloud has:

#packages
-bash-4.2$ conda list vcf2db
# packages in environment at /user/data/gent/gvo000/gvo00082/vsc41443/bcbio/anaconda:
#
# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
-bash-4.2$ conda list geneimpacts
# packages in environment at /user/data/gent/gvo000/gvo00082/vsc41443/bcbio/anaconda:
#
# Name                    Version                   Build  Channel
geneimpacts               0.3.4                    py27_0    bioconda

This is the setup that does NOT throw the annotation errors, but lacks the canonicalcolumn.

Our local cloud has:

[login] matdsmet:final $ conda list vcf2db                                                                                                                                                               [8:11:44]
 # packages in environment at /Shared/Software:
#
# Name                    Version                   Build  Channel
vcf2db                    2018.01.23               py27_0    bioconda
[login] matdsmet:final $  conda list geneimpacts                                                                                                                                                         [8:12:27]
# packages in environment at /Shared/Software:
#
# Name                    Version                   Build  Channel
geneimpacts               0.3.1                    py27_0    bioconda

Which DOES throw the annotation errors, but includes all columns.

Hope this helps to pinpoint the bug.

Cheers M

matthdsm commented 6 years ago

Looking at the geneimpacts repo, it probably has something to do with this PR https://github.com/brentp/geneimpacts/pull/13

matthdsm commented 6 years ago

Just tested and can confirm, the missing column only occurs with geneimpacts 0.3.4. I ran a test with 0.3.3, which fixed all errors.

Cheers M

chapmanb commented 6 years ago

cc'ing in Rory (@roryk) on this since it was his PR. Rory, do you know what might be going on here?

roryk commented 6 years ago

Sorry about that folks and thanks for tracking down the problem, I'll take a look.

roryk commented 6 years ago

Thanks, for the reproducible example, I'm fixing this now.

roryk commented 6 years ago

This seems to work fine with geneimpacts 0.3.4 and vcf2db 2017.12.11 and 2018.01.23:

vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db
/home/rdk4/local/share/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:219: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
3057 variant_impacts:43730  effects time: 13.7  chunk time:26.3 116.26 variants/second
indexing ... finished in 0.6 seconds...
total time: in 27.4 seconds...
>>> import geneimpacts
>>> geneimpacts.__version__
'0.3.4'
>>>
conda list vcf2db
vcf2db                    2017.12.11               py27_0    bioconda

Same result with:

vcf2db                    2018.01.23               py27_0    bioconda
roryk commented 6 years ago

Oh I see, it doesn't throw the errors with 0.3.4 but is missing the canonical column. I can read.

roryk commented 6 years ago

AFAICT, It looks like vcf2db isn't using the CANONICAL column at all. In that VCF file there are 1756 variants labelled as CANONICAL:

gzip -cd sample1_chr1.vcf.gz | grep -v '^#' | cut -f26 -d'|'  | sort | uniq -c
   1301
   1756 YES

I opened two pull requests, one to have geneimpacts set is_canonical instead of canonical (https://github.com/brentp/geneimpacts/pull/14) and one to have vcf2db.py set the is_canonical column: (https://github.com/quinlan-lab/vcf2db/pull/43) for VEP-annotated variants.

Loading the database like this:

python /n/app/bcbio/dev/rory-dev/vcf2db/vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db
gemini query -q "select is_canonical from variants" sample1.db | sort | uniq -c
   1173 0
   1884 1

I'm not super sure where the discrepancy is coming from.

brentp commented 6 years ago

not sure I understand how you go more canonicals in variants that you do in the VCF.

but, I don't see where you are telling the geneimpacts classes to prioritize_canonical.

brentp commented 6 years ago

for your first command, counting in the vcf, I would use:

bcftools query -f "%INFO/CSQ" $vcf | awk 'BEGIN{FS="|"}(NF > 25) { print $26 }'

or something like that to make sure you're not getting missing columns.

roryk commented 6 years ago

Thanks, Brent, looks similar:

bcftools query -f %INFO/CSQ"\n" sample1_chr1.vcf.gz | awk 'BEGIN{FS="|"}(NF > 25) { print $26 }' | sort | uniq -c
   1301
   1756 YES

I was trying to fix the issue where the canonical status isn't populated to the variants table, even though it is set. So variants aren't necessarily prioritized by canonical status, but the status doesn't show up in the table anyway.

brentp commented 6 years ago

what does your vcf header for CSQ look like?

roryk commented 6 years ago

Heya, this is the VCF linked here from further up in the issue:

https://github.com/quinlan-lab/vcf2db/files/1941507/sample1_chr1.vcf.gz
roryk commented 6 years ago

Duh, I see-- some of the higher impact variants are also canonical, which affects what is loaded in the main table. I pushed another change to vcf2db.py to include a prioritize_canonical flag to actually use the canonical prioritization when loading up the database, so now there are two changes. If canonical is set, add it as a field to the main variants table. If it is set, and prioritize_canonical is set, prefer canonical variants above all else. Feel free to thumbs down all of these changes, I just rabbitholed on the original issue.

brentp commented 6 years ago

also, I think your is_canonical should be:

      @property
      def is_canonical(self):                                                                                                                                                   
         return self.effects.get("CANONICAL", "") != "" 

so that it is forced to return a boolean. otherwise, I get an error because it's using "YES"

matthdsm commented 6 years ago

Hi Brent,

Sorry to bother you again, but this issue still isn't fixed. I get the same error using the latest version of vcf2db/geneimpacts.

Are there any logs of whatever I could provide to help debugging? M

brentp commented 6 years ago

@matthdsm can you help me debug. I did this:

brentp@lorax:~/src/vcf2db$ python -c "import geneimpacts; print geneimpacts.__version__"
0.3.5
brentp@lorax:~/src/vcf2db$ python vcf2db.py sample1_chr1.vcf.gz sample1.ped x.db 
/data/gemini_install/data/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'fam1'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
/data/gemini_install/data/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
3057 variant_impacts:43730  effects time: 13.3  chunk time:24.9 122.76 variants/second
indexing ... finished in 0.1 seconds...
total time: in 25.4 seconds...
brentp@lorax:~/src/vcf2db$ echo ".schema variants" | sqlite3 x.db  | grep canon
    is_canonical BOOLEAN, 
    CHECK (is_canonical IN (0, 1)), 
brentp@lorax:~/src/vcf2db$ echo "select is_canonical from variants;" | sqlite3 x.db  | sort | uniq -c
   1173 0
   1884 1
brentp@lorax:~/src/vcf2db$ echo "select is_canonical from variant_impacts;" | sqlite3 x.db  | sort | uniq -c
 35630 0
   8100 1
brentp@lorax:~/src/vcf2db$ python count-canon.py sample1_chr1.vcf.gz
[35630, 8100]

So count-canon.py shows that what's in the database matches what's in the VCF. I'm probably missing something obvious, but I don't know what is the problem you're encountering.

here is count-canon.py:

from cyvcf2 import VCF
import sys

ch = " Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|REFSEQ_MATCH|SOURCE|GIVEN_REF|USED_REF|BAM_EDIT|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|SpliceRegion".split("|") 

canons = [0, 0]
for v in VCF(sys.argv[1]):
    csq = v.INFO.get("CSQ")
    if csq is None:
        canons[0] += 1
    else:
        for c in csq.split(","):
            d = dict(zip(ch, c.split("|")))
            canons[int(d.get("CANONICAL") == "YES")] += 1

print canons
matthdsm commented 6 years ago

Hi Brent,

I got the following error:

Traceback (most recent call last):
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 920, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 232, in __init__
    self.load()
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 317, in load
    i = self._load(self.cache, create=True, start=1)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 310, in _load
    self.insert(variants, expanded, keys, i, create=create)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 372, in insert
    vilengths, variant_impacts)
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 400, in _insert
    self.__insert(v_objs, self.metadata.tables['variants'].insert())
  File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 434, in __insert
    raise e
sqlalchemy.exc.StatementError: (exceptions.TypeError) Not a boolean value:

This error occurs when using the bioconda version of geneimpacts. I noticed this version doesn't include @roryk's latest commits. Would it work for you to update the latest release to include that commit?

When using a manual install of geneimpacts, this error doesn't occur AND the canonical col is fixed.

I do notice that the output of the SQL query and count-canon.py don't match..

[10:04:15] matdsmet:HSQ_136 $ echo "select is_canonical from variants;" | sqlite3 D1711586-joint-gatk-haplotype.db  | sort | uniq -c 
10974 0
21276 1
[10:04:30] matdsmet:HSQ_136 $ echo "select is_canonical from variant_impacts;" | sqlite3 D1711586-joint-gatk-haplotype.db  | sort | uniq -c 
401535 0
89323 1
[10:05:26] matdsmet:HSQ_136 $ python count-canon.py D1711586-joint-gatk-haplotype-joint-annotated-decomposed.vcf.gz 
[490858, 0]

Thanks for the help M

matthdsm commented 6 years ago

For reference:

[10:10:13] matdsmet:HSQ_136 $ sqlite3 --version
3.23.1 2018-04-10 17:39:29 4bb2294022060e61de7da5c227a69ccd846ba330e31626ebcd59a94efd148b3b
[10:10:19] matdsmet:HSQ_136 $ python --version
Python 2.7.14 :: Anaconda, Inc.
[10:11:26] matdsmet:HSQ_136 $ python -c "import cyvcf2; print cyvcf2.__version__"            
0.8.4
chapmanb commented 6 years ago

Matthias; Thanks for the catch. It looks like the latest release (0.3.5) that's in bioconda is missing the last two commits (it's tagged on afa1af3 https://github.com/brentp/geneimpacts/commits/master), which I guess is providing the fix. Brent, if that makes sense with you, could you be willing to tag a 0.3.6 including those changes and I'll update the bioconda recipe. Thanks again for the help debugging this.

brentp commented 6 years ago

I bumped and tagged v0.3.6. thanks guys.

chapmanb commented 6 years ago

Nice, thank you Brent. I bumped the bioconda package with this update. Matthias -- hope this gets everything working cleanly for you. Thanks again for the help debugging.