quinlan-lab / vcf2db

create a gemini-compatible database from a VCF
MIT License
55 stars 13 forks source link

vcf2db error for gatk-haplotype vcf files #53

Closed ahy1221 closed 5 years ago

ahy1221 commented 5 years ago

Hi, I am using bcbio cancer germline and somatic mutation pipeline for my own exome data. The bcbio pipeline work so well until the annotation step using gemini-vcf2db. I am getting an error caused by vcf2db command:

[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/hg19.CpG.bed.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000192.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/genetic_map_HapMapII_GRCh37.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/hg19.dgv.bed.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000192.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/stam.125cells.dnaseI.hg19.bed.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/wgEncodeRegTfbsClusteredV2.cell_count.20130213.bed.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/genetic_map_HapMapII_GRCh37.gz
[2018-10-27T22:26Z] bix.go:221: chromosome GL000225.1 not found in /WPS1/heyao/tools/bio/bcbio/gemini_data/stam.125cells.dnaseI.hg19.bed.gz
[2018-10-27T22:26Z] vcfanno.go:241: annotated 228965 variants in 458.57 seconds (499.3 / second)
[2018-10-27T22:26Z] tabix index D20180108_normal-germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz
[2018-10-27T22:26Z] GEMINI: create database with vcf2db
[2018-10-27T22:27Z] not in VCF: D20171215_normal,D20171109_normal,D20180116_normal,D20180110_normal
[2018-10-27T22:27Z] /WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'D20180108_normal-germline'. (this warning may be suppressed after 10 occurrences)
[2018-10-27T22:27Z]   (util.ellipses_string(value),))
[2018-10-27T22:27Z] /WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
[2018-10-27T22:27Z]   (util.ellipses_string(value),))
[2018-10-27T22:27Z] Traceback (most recent call last):
[2018-10-27T22:27Z]   File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 916, in <module>
[2018-10-27T22:27Z]     impacts_extras=a.impacts_field, aok=a.a_ok)
[2018-10-27T22:27Z]   File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 232, in __init__
[2018-10-27T22:27Z]     self.load()
[2018-10-27T22:27Z]   File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 317, in load
[2018-10-27T22:27Z]     i = self._load(self.cache, create=True, start=1)
[2018-10-27T22:27Z]   File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 280, in _load
[2018-10-27T22:27Z]     self.genotype_counts[0][gt_types == self.vcf.HOM_REF] += 1
[2018-10-27T22:27Z] IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 2
[2018-10-27T22:27Z] Uncaught exception occurred

Traceback (most recent call last):
File "/WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 23, in run _do_run(cmd, checks, log_stdout, env=env)
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 103, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command '/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/gemini/D20180108_normal-germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/gemini/D20180108_normal-germline-gatk-haplotype-nomultiallelic.ped /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/bcbiotx/tmpfe1BUB/D20180108_normal-germline-gatk-haplotype.db
not in VCF: D20171215_normal,D20171109_normal,D20180116_normal,D20180110_normal
/WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value 'D20180108_normal-germline'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
/WPS1/heyao/tools/bio/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:226: SAWarning: Unicode type received non-unicode bind param value '-9'. (this warning may be suppressed after 10 occurrences)
  (util.ellipses_string(value),))
Traceback (most recent call last):
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 916, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 232, in __init__
    self.load()
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 317, in load
    i = self._load(self.cache, create=True, start=1)
  File "/WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py", line 280, in _load
    self.genotype_counts[0][gt_types == self.vcf.HOM_REF] += 1
IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 2
' returned non-zero exit status 1

The gatk-haplotype vcf file and ped file are both generated by bcbio pipeline. Thus I am not sure what is wrong with my vcf file for vcf2db. But I am trying to grep related commands about generating this vcf files

[2018-10-27T22:19Z] /WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcfanno -p 16 -lua /WPS1/heyao/tools/bio/bcbio/genomes/Hsapiens/GRCh37/config/vcfanno/gemini.lua -base-path /WPS1/heyao/tools/bio/bcbio/gemini_data /WPS1/heyao/tools/bio/bcbio/genomes/Hsapiens/GRCh37/config/vcfanno/gemini.conf /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/gemini/D20180108_normal-germline-gatk-haplotype-nomultiallelic.vcf.gz | sed -e 's/Number=A/Number=1/g' | bgzip -c > /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/bcbiotx/tmpAccAR1/D20180108_normal-germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz
[2018-10-27T22:26Z] /WPS1/heyao/tools/bio/bcbio/anaconda/bin/tabix -f -p vcf /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/bcbiotx/tmp8CnG0D/D20180108_normal-germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz
[2018-10-27T22:26Z] /WPS1/heyao/tools/bio/bcbio/anaconda/bin/vcf2db.py /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/gemini/D20180108_normal-germline-gatk-haplotype-nomultiallelic-annotated-gemini.vcf.gz /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/gemini/D20180108_normal-germline-gatk-haplotype-nomultiallelic.ped /WPSnew/heyao/project/HCC/pipeline/bcbio_exome/exome_10X_donor/work/bcbiotx/tmpfe1BUB/D20180108_normal-germline-gatk-haplotype.db

The vcf and ped file for recovering this error can be download by this link: debug files

Thanks in advance!

brentp commented 5 years ago

D20180108_normal occurs twice in D20180108_normal-germline-gatk-haplotype-nomultiallelic.ped if you remove that then it runs.

ahy1221 commented 5 years ago

@brentp Thank you very much ! It works now. How this mistaken ped file be generated ? I have tried to grep commands related to ped file but find nothing .

brentp commented 5 years ago

I don't know about that. @chapmanb would know.

ahy1221 commented 5 years ago

Thank you very much. Since this issue is not related to vcf2db, I closed the issue.