Closed matthdsm closed 6 years ago
Correction, the missing column only occurs in db's created by bcbio!
Ping @chapmanb. Do you have any idea where this could occur? When manually creating a db from the decomposed vcf in the final
dir, all data is present, but the pregenerated db's are missing the canonical column...
M
could be that bcbio needs to update either vcf2db or geneimpacts ?
Hi Brent,
Thanks for the reply. I don't think so, bcbio has the latest versions available in conda...
# Name Version Build Channel
vcf2db 2018.01.23 py27_0 bioconda
geneimpacts 0.3.4 py27_0 bioconda
M
Matthias and Brent;
Sorry about the issue and thanks for cc'ing me into the conversation. I'm not sure why you'd see a different between bcbio vcf2db.py
runs and those done by hand. bcbio isn't doing much fancy here beyond calling the command:
If you look in log/bcbio-nextgen-commands.log
, is the same command being run that you're testing outside? Are there any other differences that could help us explore what is happening? Thanks for the help debugging.
Hi Brent, Brad.
I did some more digging, and I think the error is coming from the geneimpacts module. Our HPC cloud has:
#packages
-bash-4.2$ conda list vcf2db
# packages in environment at /user/data/gent/gvo000/gvo00082/vsc41443/bcbio/anaconda:
#
# Name Version Build Channel
vcf2db 2018.01.23 py27_0 bioconda
-bash-4.2$ conda list geneimpacts
# packages in environment at /user/data/gent/gvo000/gvo00082/vsc41443/bcbio/anaconda:
#
# Name Version Build Channel
geneimpacts 0.3.4 py27_0 bioconda
This is the setup that does NOT throw the annotation errors, but lacks the canonical
column.
Our local cloud has:
[login] matdsmet:final $ conda list vcf2db [8:11:44]
# packages in environment at /Shared/Software:
#
# Name Version Build Channel
vcf2db 2018.01.23 py27_0 bioconda
[login] matdsmet:final $ conda list geneimpacts [8:12:27]
# packages in environment at /Shared/Software:
#
# Name Version Build Channel
geneimpacts 0.3.1 py27_0 bioconda
Which DOES throw the annotation errors, but includes all columns.
Hope this helps to pinpoint the bug.
Cheers M
Looking at the geneimpacts repo, it probably has something to do with this PR https://github.com/brentp/geneimpacts/pull/13
Just tested and can confirm, the missing column only occurs with geneimpacts 0.3.4. I ran a test with 0.3.3, which fixed all errors.
Cheers M
cc'ing in Rory (@roryk) on this since it was his PR. Rory, do you know what might be going on here?
Sorry about that folks and thanks for tracking down the problem, I'll take a look.
Thanks, for the reproducible example, I'm fixing this now.
This seems to work fine with geneimpacts 0.3.4 and vcf2db 2017.12.11 and 2018.01.23:
vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db
/home/rdk4/local/share/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:219: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
(util.ellipses_string(value),))
3057 variant_impacts:43730 effects time: 13.7 chunk time:26.3 116.26 variants/second
indexing ... finished in 0.6 seconds...
total time: in 27.4 seconds...
>>> import geneimpacts
>>> geneimpacts.__version__
'0.3.4'
>>>
conda list vcf2db
vcf2db 2017.12.11 py27_0 bioconda
Same result with:
vcf2db 2018.01.23 py27_0 bioconda
Oh I see, it doesn't throw the errors with 0.3.4 but is missing the canonical column. I can read.
AFAICT, It looks like vcf2db isn't using the CANONICAL column at all. In that VCF file there are 1756 variants labelled as CANONICAL:
gzip -cd sample1_chr1.vcf.gz | grep -v '^#' | cut -f26 -d'|' | sort | uniq -c
1301
1756 YES
I opened two pull requests, one to have geneimpacts set is_canonical
instead of canonical
(https://github.com/brentp/geneimpacts/pull/14) and one to have vcf2db.py
set the is_canonical
column: (https://github.com/quinlan-lab/vcf2db/pull/43) for VEP-annotated variants.
Loading the database like this:
python /n/app/bcbio/dev/rory-dev/vcf2db/vcf2db.py sample1_chr1.vcf.gz ped.ped sample1.db
gemini query -q "select is_canonical from variants" sample1.db | sort | uniq -c
1173 0
1884 1
I'm not super sure where the discrepancy is coming from.
not sure I understand how you go more canonicals in variants that you do in the VCF.
but, I don't see where you are telling the geneimpacts
classes to prioritize_canonical
.
for your first command, counting in the vcf, I would use:
bcftools query -f "%INFO/CSQ" $vcf | awk 'BEGIN{FS="|"}(NF > 25) { print $26 }'
or something like that to make sure you're not getting missing columns.
Thanks, Brent, looks similar:
bcftools query -f %INFO/CSQ"\n" sample1_chr1.vcf.gz | awk 'BEGIN{FS="|"}(NF > 25) { print $26 }' | sort | uniq -c
1301
1756 YES
I was trying to fix the issue where the canonical status isn't populated to the variants table, even though it is set. So variants aren't necessarily prioritized by canonical status, but the status doesn't show up in the table anyway.
what does your vcf header for CSQ look like?
Heya, this is the VCF linked here from further up in the issue:
https://github.com/quinlan-lab/vcf2db/files/1941507/sample1_chr1.vcf.gz
Duh, I see-- some of the higher impact variants are also canonical, which affects what is loaded in the main table. I pushed another change to vcf2db.py to include a prioritize_canonical
flag to actually use the canonical prioritization when loading up the database, so now there are two changes. If canonical is set, add it as a field to the main variants table. If it is set, and prioritize_canonical is set, prefer canonical variants above all else. Feel free to thumbs down all of these changes, I just rabbitholed on the original issue.
also, I think your is_canonical
should be:
@property
def is_canonical(self):
return self.effects.get("CANONICAL", "") != ""
so that it is forced to return a boolean. otherwise, I get an error because it's using "YES"
Hi Brent,
Sorry to bother you again, but this issue still isn't fixed. I get the same error using the latest version of vcf2db/geneimpacts.
Are there any logs of whatever I could provide to help debugging? M
@matthdsm can you help me debug. I did this:
brentp@lorax:~/src/vcf2db$ python -c "import geneimpacts; print geneimpacts.__version__"
0.3.5
brentp@lorax:~/src/vcf2db$ python vcf2db.py sample1_chr1.vcf.gz sample1.ped x.db
/data/gemini_install/data/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'fam1'. (this warning may be suppressed after 10 occurrences)
(util.ellipses_string(value),))
/data/gemini_install/data/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
(util.ellipses_string(value),))
3057 variant_impacts:43730 effects time: 13.3 chunk time:24.9 122.76 variants/second
indexing ... finished in 0.1 seconds...
total time: in 25.4 seconds...
brentp@lorax:~/src/vcf2db$ echo ".schema variants" | sqlite3 x.db | grep canon
is_canonical BOOLEAN,
CHECK (is_canonical IN (0, 1)),
brentp@lorax:~/src/vcf2db$ echo "select is_canonical from variants;" | sqlite3 x.db | sort | uniq -c
1173 0
1884 1
brentp@lorax:~/src/vcf2db$ echo "select is_canonical from variant_impacts;" | sqlite3 x.db | sort | uniq -c
35630 0
8100 1
brentp@lorax:~/src/vcf2db$ python count-canon.py sample1_chr1.vcf.gz
[35630, 8100]
So count-canon.py
shows that what's in the database matches what's in the VCF. I'm probably missing something obvious, but I don't know what is the problem you're encountering.
here is count-canon.py:
from cyvcf2 import VCF
import sys
ch = " Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|REFSEQ_MATCH|SOURCE|GIVEN_REF|USED_REF|BAM_EDIT|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|SpliceRegion".split("|")
canons = [0, 0]
for v in VCF(sys.argv[1]):
csq = v.INFO.get("CSQ")
if csq is None:
canons[0] += 1
else:
for c in csq.split(","):
d = dict(zip(ch, c.split("|")))
canons[int(d.get("CANONICAL") == "YES")] += 1
print canons
Hi Brent,
I got the following error:
Traceback (most recent call last):
File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 920, in <module>
impacts_extras=a.impacts_field, aok=a.a_ok)
File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 232, in __init__
self.load()
File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 317, in load
i = self._load(self.cache, create=True, start=1)
File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 310, in _load
self.insert(variants, expanded, keys, i, create=create)
File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 372, in insert
vilengths, variant_impacts)
File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 400, in _insert
self.__insert(v_objs, self.metadata.tables['variants'].insert())
File "/Users/matdsmet/miniconda2/bin/vcf2db.py", line 434, in __insert
raise e
sqlalchemy.exc.StatementError: (exceptions.TypeError) Not a boolean value:
This error occurs when using the bioconda version of geneimpacts
. I noticed this version doesn't include @roryk's latest commits.
Would it work for you to update the latest release to include that commit?
When using a manual install of geneimpacts
, this error doesn't occur AND the canonical col is fixed.
I do notice that the output of the SQL query and count-canon.py
don't match..
[10:04:15] matdsmet:HSQ_136 $ echo "select is_canonical from variants;" | sqlite3 D1711586-joint-gatk-haplotype.db | sort | uniq -c
10974 0
21276 1
[10:04:30] matdsmet:HSQ_136 $ echo "select is_canonical from variant_impacts;" | sqlite3 D1711586-joint-gatk-haplotype.db | sort | uniq -c
401535 0
89323 1
[10:05:26] matdsmet:HSQ_136 $ python count-canon.py D1711586-joint-gatk-haplotype-joint-annotated-decomposed.vcf.gz
[490858, 0]
Thanks for the help M
For reference:
[10:10:13] matdsmet:HSQ_136 $ sqlite3 --version
3.23.1 2018-04-10 17:39:29 4bb2294022060e61de7da5c227a69ccd846ba330e31626ebcd59a94efd148b3b
[10:10:19] matdsmet:HSQ_136 $ python --version
Python 2.7.14 :: Anaconda, Inc.
[10:11:26] matdsmet:HSQ_136 $ python -c "import cyvcf2; print cyvcf2.__version__"
0.8.4
Matthias; Thanks for the catch. It looks like the latest release (0.3.5) that's in bioconda is missing the last two commits (it's tagged on afa1af3 https://github.com/brentp/geneimpacts/commits/master), which I guess is providing the fix. Brent, if that makes sense with you, could you be willing to tag a 0.3.6 including those changes and I'll update the bioconda recipe. Thanks again for the help debugging this.
I bumped and tagged v0.3.6. thanks guys.
Nice, thank you Brent. I bumped the bioconda package with this update. Matthias -- hope this gets everything working cleanly for you. Thanks again for the help debugging.
Hi Brent,
I got the following error using vcf2db:
It seems vcf2db has a problem with some data inside the CSQ tag from VEP. I also noticed not all data from the CSQ tag makes it into the db. Especially annoying for us is that for some reason, the
canonical
column is nowhere to be found, even though it's in the VCF.I've included a sample vcf, so you might be able to replicate the error. The vcf was created using bcbio v1.0.9
Cheers and thanks for the help M