quinlan-lab / vcf2db

create a gemini-compatible database from a VCF
MIT License
55 stars 13 forks source link

issue with creating database #57

Open atimms opened 5 years ago

atimms commented 5 years ago

Hello...

I used gemini and vcf2db previously with great successful, but I'm having issues when using a new set of VCFs I've just received..

I annotated with snpeff in the my usual way but received the following error message:

Traceback (most recent call last): File "/home/atimms/programs/vcf2db/vcf2db.py", line 923, in impacts_extras=a.impacts_field, aok=a.a_ok) File "/home/atimms/programs/vcf2db/vcf2db.py", line 233, in init self.load() File "/home/atimms/programs/vcf2db/vcf2db.py", line 318, in load i = self._load(self.cache, create=True, start=1) File "/home/atimms/programs/vcf2db/vcf2db.py", line 311, in _load self.insert(variants, expanded, keys, i, create=create) File "/home/atimms/programs/vcf2db/vcf2db.py", line 373, in insert vilengths, variant_impacts) File "/home/atimms/programs/vcf2db/vcf2db.py", line 401, in _insert self.__insert(v_objs, self.metadata.tables['variants'].insert()) File "/home/atimms/programs/vcf2db/vcf2db.py", line 443, in __insert trans.execute(stmt, o) File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 980, in execute return meth(self, multiparams, params) File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 273, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1099, in _execute_clauseelement distilled_params, File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1240, in _execute_context e, statement, parameters, cursor, context File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1458, in _handle_dbapi_exception util.raise_from_cause(sqlalchemy_exception, exc_info) File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 296, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1236, in _execute_context cursor, statement, parameters, context File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 536, in do_execute cursor.execute(statement, parameters) sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 48 - probably unsupported type. [SQL: u'INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, an, baseqranksum, clippingranksum, db, dp, ds, excesshet, fs, mq, mqranksum, negative_train_site, pg, positive_train_site, qd, raw_mq, readposranksum, sor, vqslod, culprit, loconfdenovo, old_multiallelic, old_variant, lof, consequence, symbol, feature_type, feature, intron, hgvsc, hgvsp, cdna_position, cds_position, protein_position, amino_acids, codons, existing_variation, distance, strand, flags, variant_class, symbol_source, hgnc_id, canonical, sift, hgvs_offset, hgvsg, amino_acid_change, transcript_biotype, gene_coding, transcript_id, exon_rank, genotype, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (1, u'chr1', 10143, 10150, None, u'TAACCCC', u'T', 120.08000183105469, None, 'indel', 'del', 1.0, 1, 2, 0, 0, 0.3333333333333333, u'DDX11L1', None, u'ENST00000456328', 0, 0, 0, 0, 0, u'', u'1724', u'', None, u'processed_transcript', 'upstream_gene_variant', 'upstream_gene_variant', 'LOW', None, None, None, None, 6, -0.550000011920929, -0.550000011920929, 0, 75, 0, 3.9793999195098877, 0.0, 22.270000457763672, 0.9369999766349792, 0, (0, 0, 0), 0, 17.149999618530273, 17356.0, 0.9369999766349792, 0.36800000071525574, 3.0899999141693115, u'FS', None, None, u'None', u'None', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', u'', u'processed_transcript', u'NON_CODING', u'ENST00000456328', u'', u'T', <read-only buffer for 0x7fffdfe884f8, size -1, offset 0 at 0x7fffdef27270>, <read-only buffer for 0x7fffdfeed7a0, size -1, offset 0 at 0x7fffdef272b0>, <read-only buffer for 0x7fffdfe91120, size -1, offset 0 at 0x7fffdef272f0>, <read-only buffer for 0x7fffdfeed7d8, size -1, offset 0 at 0x7fffdef27330>, <read-only buffer for 0x7fffdfeed810, size -1, offset 0 at 0x7fffdef27370>, <read-only buffer for 0x7fffdfeed848, size -1, offset 0 at 0x7fffdef273b0>, <read-only buffer for 0x7fffdfeed880, size -1, offset 0 at 0x7fffdef273f0>, <read-only buffer for 0x7fffdfef8b30, size -1, offset 0 at 0x7fffdef27430>)] (Background on this error at: http://sqlalche.me/e/rvf5)

the VCF I received does have some strange fields in the genotypes (generated by GATK), here's an example line...

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT F03-00008 F03-00006 F03-00007

chr1 10144 . TAACCCC T 120.08 PASS AC=2;AF=0.333;AN=6;BaseQRankSum=-0.55;ClippingRankSum=-0.55;DP=75;ExcessHet=3.9794;FS=0;MLEAC=1;MLEAF=0.25;MQ=22.27;MQRankSum=0.937;PG=0,0,0;QD=17.15;RAW_MQ=17356;ReadPosRankSum=0.937;SOR=0.368;VQSLOD=3.09;culprit=FS;EFF=MOTIFMA0341.1:Egr1,MOTIFMA0366.1:Egr1,UPSTREAM(MODIFIER||1724|||DDX11L1|processed_transcript|NON_CODING|ENST00000456328||T),UPSTREAM(MODIFIER||1865|||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST00000450305||T),DOWNSTREAM(MODIFIER||4259|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000488147||T),INTERGENIC(MODIFIER||||||||||T) GT:AD:DP:FT:GQ:JL:JP:PL:PP 0/1:40,0:40:lowGQ:2:-1:-1:0,0,545:2,0,547 0/1:3,4:7:PASS:50:-1:-1:126,0,46:127,0,50 0/0:23,0:23:lowGQ:0:.:.:0,0,0:0,0,0

Any help would be greatly appreciated.

Andrew

brentp commented 5 years ago

looks like PG is the offending field with 0,0,0. what does the vcf header show for PG? you can also send that field to the --black-list so it does not try to load.

atimms commented 5 years ago

in the vcf header it is described as:

INFO=

and when I included: -e PG to vcf2db the gemini database worked...

Thanks for getting back to me so quickly and resolving my issue..

Andrew

brentp commented 5 years ago

hmm. did you run vt decompose -s ? or bcftools norm? I wonder why that field did not get normalized.

atimms commented 5 years ago

i ran vt decompose -s on the vcf before loading...

The only difference with this vcf was it had been put through the GATK refinement workflow i.e. https://gatkforums.broadinstitute.org/gatk/discussion/4723/genotype-refinement-workflow. I wonder if that affected something?

Andrew

mmoisse commented 4 years ago

I also had that issue now I always do

filter=""
if [ `bcftools view --header input.vcf.gz | egrep '##INFO=<ID=GC,|##INFO=<ID=PG,' | wc -l` -gt 0 ]
then
   filter="-x INFO/GC,INFO/PC"
fi

vcf2db.py <(bcftools annotate $filter inpud.vcf.gz | bcftools +fixploidy) input.ped gemini.db
huangk3 commented 4 years ago

I ran into the same issue as well:

Traceback (most recent call last):
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 923, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 233, in __init__
    self.load()
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 321, in load
    self._load(self.vcf, create=False, start=i+1)
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 305, in _load
    self.insert(variants, expanded, keys, i)
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 373, in insert
    vilengths, variant_impacts)
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 401, in _insert
    self.__insert(v_objs, self.metadata.tables['variants'].insert())
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 435, in __insert
    raise e
sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 170 - probably unsupported type.
[SQL: INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, ac, af, an, baseqranksum, clippingranksum, db, dp, ds, exome_chip, excesshet, fs, inbreedingcoeff, lcr, mleac, mleaf, mq, mqranksum, negative_train_site, old_multiallelic, old_variant, positive_train_site, qd, rvis, rvis_pct, rvis_pred, readposranksum, sor, vqslod, aaf_1kg_afr_float, aaf_1kg_all_float, aaf_1kg_amr_float, aaf_1kg_eas_float, aaf_1kg_eur_float, aaf_1kg_sas_float, aaf_esp_aa, aaf_esp_all, aaf_esp_ea, aaf_pid_711, ac_exac_afr, ac_exac_all, ac_exac_amr, ac_exac_eas, ac_exac_fin, ac_exac_nfe, ac_exac_oth, ac_exac_sas, acetyl_enh_33_cell_count, acetyl_enh_33_cell_list, acetyl_enh_all_127_tiss_count, active_enh_33_cell_count, active_enh_33_cell_list, active_enh_all_127_tiss_count, af_exac_afr, af_exac_all, af_exac_amr, af_exac_eas, af_exac_nfe, af_exac_oth, af_exac_sas, an_exac_afr, an_exac_all, an_exac_amr, an_exac_eas, an_exac_fin, an_exac_nfe, an_exac_oth, an_exac_sas, clinvar_disease_name, clinvar_pathogenic, common_pathogenic, cse_hiseq, culprit, dann_score, dbsnp_id, dpsi_max_tissue, dpsi_zscore, eigen_pc_phred, eigen_phred, fitcons, fuzzy_hgmd_class, fuzzy_hgmd_dna, fuzzy_hgmd_id, fuzzy_hgmd_orig_dna, fuzzy_hgmd_orig_prot, fuzzy_hgmd_pheno, fuzzy_hgmd_prot, gerp_elements, gno_exome_ac_afr, gno_exome_ac_all, gno_exome_ac_amr, gno_exome_ac_asj, gno_exome_ac_eas, gno_exome_ac_fin, gno_exome_ac_nfe, gno_exome_ac_oth, gno_exome_ac_sas, gno_exome_af_afr, gno_exome_af_all, gno_exome_af_amr, gno_exome_af_asj, gno_exome_af_eas, gno_exome_af_fin, gno_exome_af_nfe, gno_exome_af_oth, gno_exome_af_sas, gno_exome_an_afr, gno_exome_an_all, gno_exome_an_amr, gno_exome_an_asj, gno_exome_an_eas, gno_exome_an_fin, gno_exome_an_nfe, gno_exome_an_oth, gno_exome_an_sas, gno_exome_filter, gno_exome_id, gno_genome_ac_afr, gno_genome_ac_all, gno_genome_ac_amr, gno_genome_ac_asj, gno_genome_ac_eas, gno_genome_ac_fin, gno_genome_ac_nfe, gno_genome_ac_oth, gno_genome_af_afr, gno_genome_af_all, gno_genome_af_amr, gno_genome_af_asj, gno_genome_af_eas, gno_genome_af_fin, gno_genome_af_nfe, gno_genome_af_oth, gno_genome_af_sas, gno_genome_an_afr, gno_genome_an_all, gno_genome_an_amr, gno_genome_an_asj, gno_genome_an_eas, gno_genome_an_fin, gno_genome_an_nfe, gno_genome_an_oth, gno_genome_filter, gno_genome_id, gtex_gene_tissue_eqtl, hetaltab, hgmd_class, hgmd_dna, hgmd_indel_class, hgmd_indel_orig_dna, hgmd_indel_orig_prot, hgmd_indel_pheno, hgmd_overlap_indel_coords, hgmd_overlap_indel_id, hgmd_pheno, hgmd_prot, in_1kg, in_esp, in_exac, linsight_score, max_exac_aaf_all, max_gno_exome_aaf_all, max_gno_genome_aaf_all, mmind_cdna, mmind_id, mmind_prot, rap_score, rmsk, subgerp, subrvis, subrvis_pct, subrvis_pred, trap_cds_syn_splice_pred, trap_nc_splice_pred, weak_enh_33_cell_count, weak_enh_33_cell_list, weak_enh_all_127_tiss_count, allele, feature_type, intron, hgvsc, hgvsp, cdna_position, cds_position, existing_variation, distance, strand, flags, symbol_source, hgnc_id, ccds, hgvs_offset, appris, aloft_confidence, aloft_fraction_transcripts_affected, aloft_pred, aloft_prob_dominant, aloft_prob_recessive, aloft_prob_tolerant, ancestral_allele, cadd_phred, cadd_raw, deogen2_pred, deogen2_score, fathmm_pred, fathmm_score, genocanyon_score, interpro_domain, lrt_pred, lrt_score, m_cap_pred, m_cap_score, mpc_score, mvp_score, metalr_pred, metalr_score, metasvm_pred, metasvm_score, mutpred_aachange, mutpred_top5features, mutpred_protid, mutpred_score, mutationassessor_pred, mutationassessor_score, mutationtaster_pred, mutationtaster_score, provean_pred, provean_score, primateai_pred, primateai_score, revel_rankscore, revel_score, reliability_index, vest4_score, clinvar_clnsig, clinvar_review, clinvar_trait, lof, lof_filter, lof_flags, lof_info, mes_ncss_downstream_acceptor, mes_ncss_downstream_acceptor_seq, mes_ncss_downstream_donor, mes_ncss_downstream_donor_seq, mes_ncss_upstream_acceptor, mes_ncss_upstream_acceptor_seq, mes_ncss_upstream_donor, mes_ncss_upstream_donor_seq, mes_swa_acceptor_alt, mes_swa_acceptor_alt_context, mes_swa_acceptor_alt_frame, mes_swa_acceptor_alt_seq, mes_swa_acceptor_diff, mes_swa_acceptor_ref, mes_swa_acceptor_ref_comp, mes_swa_acceptor_ref_comp_seq, mes_swa_acceptor_ref_context, mes_swa_acceptor_ref_frame, mes_swa_acceptor_ref_seq, mes_swa_donor_alt, mes_swa_donor_alt_context, mes_swa_donor_alt_frame, mes_swa_donor_alt_seq, mes_swa_donor_diff, mes_swa_donor_ref, mes_swa_donor_ref_comp, mes_swa_donor_ref_comp_seq, mes_swa_donor_ref_context, mes_swa_donor_ref_frame, mes_swa_donor_ref_seq, maxentscan_alt, maxentscan_alt_seq, maxentscan_diff, maxentscan_ref, maxentscan_ref_seq, ada_score, rf_score, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]
[parameters: (314700, u'21', 41554522, 41554750, u'rs114481025;rs34163425', u'GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC', u'G', 220.89999389648438, u'VQSRTrancheINDEL99.70to99.80', 'indel', 'del', 1.0, 128, 4, 0, 0, 0.015151515151515152, u'DSCAM', u'ENSG00000171587', u'ENST00000400454', 0, 0, 0, 0, 1, '', '', '', u'', u'protein_coding', u'intron_variant', u'intron_variant', 'LOW', u'', None, u'', None, 4, 0.0, 264, 0.6340000033378601, -0.22699999809265137, 0, 6786, 0, 0, 5.850200176239014, 2.736999988555908, -0.07159999758005142, 0, 5, 0.01899999938905239, 51.58000183105469, 4.010000228881836, 1, None, 'None', 0, 1.3300000429153442, None, None, None, -0.07100000232458115, 0.8799999952316284, -2.7079999446868896, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, 0, 0, u'MQRankSum', None, None, None, None, None, None, 0.06549999862909317, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, None, None, None, None, None, None, None, None, None, 1, 4, 0, 0, 1, 0, 2, 0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0, (5824, 1920), (25160, 9670), (708, 318), (262, 122), (1396, 554), (3008, 1536), (13152, 4898), (810, 322), None, u'rs114481025,rs34163425', 'None', 0.5601999759674072, None, None, None, None, None, None, None, None, None, None, 0, 0, 0, None, -1.0, -1.0, 0.0, None, None, None, None, u'trf,trf,trf', None, None, None, None, None, None, None, None, None, u'-', u'Transcript', u'14/32', u'ENST00000400454.1:c.2780-3729_2780-3503del', u'', u'', u'', u'', u'', u'-1', u'', u'HGNC', u'3039', u'CCDS42929.1', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'-4.728', u'TGAAAAACAAAACCCAAAAGACT', u'10.655', u'CAGGTACGT', u'5.047', u'TCTTTCTGTTGATGGCACAGAGC', u'10.858', u'CAGGTAAGT', u'2.672', u'CGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT', u'6', u'TGTCCACGGTTTCTCTGCTGGCC', u'3.220', u'5.892', u'5.892', u'GTCCACGGTTTCTCTGGTAGCCC', u'CGTTCTGTCCACGGTTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT', u'92', u'GTCCACGGTTTCTCTGGTAGCCC', u'-11.010', u'TTTCTCTGCTGGCCCC', u'6', u'CTGCTGGCC', u'15.518', u'4.508', u'4.508', u'CTGGTAACC', u'TTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCC', u'6', u'CTGGTAACC', u'', u'', u'', u'', u'', u'', u'', <read-only buffer for 0x55555a1f2790, size -1, offset 0 at 0x2aaaf86baa70>, <read-only buffer for 0x2aaae10013f0, size -1, offset 0 at 0x2aaaf86baab0>, <read-only buffer for 0x2aaaea50d5a8, size -1, offset 0 at 0x2aaaf86baaf0>, <read-only buffer for 0x2aaaea2a5ab0, size -1, offset 0 at 0x2aaaf86bab30>, <read-only buffer for 0x2aaae9bfdc70, size -1, offset 0 at 0x2aaaf86bab70>, <read-only buffer for 0x2aaaea503d98, size -1, offset 0 at 0x2aaaf86babb0>, <read-only buffer for 0x2aaaea1e17b0, size -1, offset 0 at 0x2aaaf86babf0>, <read-only buffer for 0x2aaaea447eb0, size -1, offset 0 at 0x2aaaf86bac30>)]
(Background on this error at: http://sqlalche.me/e/rvf5)

The VCF had been processed by 'vt decompose'

21      41554523        rs114481025;rs34163425  GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAA
CCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC    G       220.9   VQSRTrancheINDEL99.70to99.80    AC=4;AF=0;AN=264
;BaseQRankSum=0.634;ClippingRankSum=-0.227;DP=6786;ExcessHet=5.8502;FS=2.737;InbreedingCoeff=-0.0716;MLEAC=5;MLEAF=0.019;MQ=51.58;MQRankSum=4.01;NEGATIVE_TRAIN_SITE;QD=1.33
;ReadPosRankSum=-0.071;SOR=0.88;VQSLOD=-2.708;culprit=MQRankSum;hetAltAB=0.5602;CSQ=-|intron_variant|MODIFIER|DSCAM|ENSG00000171587|Transcript|ENST00000400454|protein_codin
g||14/32|ENST00000400454.1:c.2780-3729_2780-3503del|||||||||-1||HGNC|3039|YES|CCDS42929.1|||||||||||||||||||||||||||||||||||||||||||||||||||||-4.728|TGAAAAACAAAACCCAAAAGACT|10.655|CAGGTACGT|5.047|TCTTTCTGTTGATGGCACAGAGC|10.858|CAGGTAAGT|2.672|CGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT|6|TGTCCACGGTTTCTCTGCTGGCC|3.220|5.892|5.892|GTCCACGGTTTCTCTGGTAGCCC|CGTTCTGTCCACGGTTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT|92|GTCCACGGTTTCTCTGGTAGCCC|-11.010|TTTCTCTGCTGGCCCC|6|CTGCTGGCC|15.518|4.508|4.508|CTGGTAACC|TTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCC|6|CTGGTAACC|||||||,-|intron_variant|MODIFIER|DSCAM|ENSG00000171587|Transcript|ENST00000404019|protein_coding||10/28|ENST00000404019.2:c.2036-3729_2036-3503del|||||||||-1|cds_start_NF|HGNC|3039|||||||||||||||||||||||||||||||||||||||||||||||||||||||-4.728|TGAAAAACAAAACCCAAAAGACT|10.655|CAGGTACGT|5.047|TCTTTCTGTTGATGGCACAGAGC|10.858|CAGGTAAGT|2.672|CGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT|6|TGTCCACGGTTTCTCTGCTGGCC|3.220|5.892|5.892|GTCCACGGTTTCTCTGGTAGCCC|CGTTCTGTCCACGGTTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT|92|GTCCACGGTTTCTCTGGTAGCCC|-11.010|TTTCTCTGCTGGCCCC|6|CTGCTGGCC|15.518|4.508|4.508|CTGGTAACC|TTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCC|6|CTGGTAACC|||||||;fitcons=0.0655;rmsk=trf,trf,trf;gno_genome_ac_all=4;gno_genome_an_all=25160,9670;gno_genome_ac_afr=1;gno_genome_an_afr=5824,1920;gno_genome_ac_amr=0;gno_genome_an_amr=708,318;gno_genome_ac_asj=0;gno_genome_an_asj=262,122;gno_genome_ac_eas=1;gno_genome_an_eas=1396,554;gno_genome_ac_fin=0;gno_genome_an_fin=3008,1536;gno_genome_ac_nfe=2;gno_genome_an_nfe=13152,4898;gno_genome_ac_oth=0;gno_genome_an_oth=810,322;gno_genome_id=rs114481025,rs34163425;gno_genome_af_all=0;gno_genome_af_afr=0;gno_genome_af_amr=0;gno_genome_af_asj=0;gno_genome_af_eas=0;gno_genome_af_fin=0;gno_genome_af_nfe=0;gno_genome_af_oth=0;max_gno_genome_aaf_all=0  GT:AD:DP:GQ:PGT:PID:PL  0/0:56,0:56:99:.:.:0,99,1507    0/0:102,0:102:73:.:.:0,73,2644  0/0:44,0:44:0:.:.:0,0,1166      0/0:50,0:50:51:.:.:0,51,1297    0/0:37,0:37:99:.:.:0,99,1485    0/0:38,0:38:91:.:.:0,91,1176    0/0:78,0:78:44:.:.:0,44,2164    0/0:35,0:35:0:.:.:0,0,826       0/0:42,0:42:99:.:.:0,99,1131    0/1:18,21:39:17:0|1:41554523_GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC_G:17,0,767    0/0:43,0:43:99:.:.:0,105,1244   0/0:36,0:36:99:.:.:0,104,1149   0/0:35,0:35:40:.:.:0,40,940     0/0:54,0:54:99:.:.:0,119,1800   0/0:43,0:43:99:.:.:0,106,1415   0/0:109,0:109:0:.:.:0,0,2409    0/0:53,0:53:91:.:.:0,91,1565    0/0:49,0:49:26:.:.:0,26,1100    0/0:31,0:31:0:.:.:0,0,673       0/0:27,0:27:12:.:.:0,12,729     0/0:27,0:27:12:.:.:0,12,668     0/0:55,0:55:0:.:.:0,0,1345      0/0:29,0:29:52:.:.:0,52,937     0/0:41,0:41:90:.:.:0,90,1165    0/0:102,0:102:0:.:.:0,0,2518    0/0:71,0:71:0:.:.:0,0,1740      0/0:29,0:29:25:.:.:0,25,826     0/0:36,0:36:99:.:.:0,99,1485    0/0:43,0:43:99:.:.:0,105,1537   0/0:36,0:36:99:.:.:0,102,1530   0/0:39,0:39:69:.:.:0,69,1085    0/0:46,0:46:99:.:.:0,103,1302   0/0:28,0:28:0:.:.:0,0,499       0/0:41,0:41:91:.:.:0,91,1194    0/0:38,0:38:79:.:.:0,79,1110    0/0:46,0:46:99:.:.:0,101,1273   0/0:52,0:52:0:.:.:0,0,857       0/0:39,0:39:0:.:.:0,0,1070      0/0:39,0:39:99:.:.:0,105,1575   0/0:79,0:79:0:.:.:0,0,1968      0/0:43,0:43:99:.:.:0,108,1151   0/0:62,0:62:49:.:.:0,49,1683    0/0:46,0:46:99:.:.:0,100,1324   0/0:63,0:63:99:.:.:0,120,1800   0/0:71,0:71:99:.:.:0,120,1800   0/0:55,0:55:99:.:.:0,105,1573   0/0:31,0:31:0:.:.:0,0,646       0/0:62,0:62:1:.:.:0,1,1578      0/0:26,0:26:0:.:.:0,0,614       0/0:61,0:61:99:.:.:0,120,1800   0/0:51,0:51:0:.:.:0,0,1177      0/0:52,0:52:0:.:.:0,0,1282      0/0:67,0:67:94:.:.:0,94,1900    0/0:69,0:69:99:.:.:0,100,1800   0/0:49,0:49:77:.:.:0,77,1396    0/0:64,0:64:0:.:.:0,0,1407      0/0:69,0:69:0:.:.:0,0,1612      0/0:80,0:80:80:.:.:0,80,2032    0/1:17,19:36:68:0|1:41554523_GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC_G:68,0,803    0/0:36,0:36:0:.:.:0,0,876       0/0:47,0:47:48:.:.:0,48,1287    0/0:56,0:56:99:.:.:0,101,1678   0/0:45,0:45:0:.:.:0,0,1075      0/0:113,0:113:99:.:.:0,120,1800 0/0:55,0:55:0:.:.:0,0,1316      0/1:15,17:32:99:.:.:105,0,470   0/0:82,0:82:0:.:.:0,0,2146
      0/0:39,0:39:48:.:.:0,48,1064    0/0:32,0:32:87:.:.:0,87,1305    0/0:48,0:48:0:.:.:0,0,1030      0/0:48,0:48:0:.:.:0,0,1129      0/0:51,0:51:0:.:.:0,0,1398      0/0:40,0:40:36:.:.:0,36,964     0/0:30,0:30:60:.:.:0,60,973     0/0:77,0:77:99:.:.:0,107,1800   0/0:43,0:43:93:.:.:0,93,1242    0/0:42,0:42:61:.:.:0,61,1104    0/0:37,0:37:9:.:.:0,9,1021      0/0:44,0:44:0:.:.:0,0,760       0/0:47,0:47:99:.:.:0,117,1755   0/0:72,0:72:48:.:.:0,48,2014    0/0:54,0:54:49:.:.:0,49,1406    0/0:48,0:48:99:.:.:0,120,1474
   0/0:45,0:45:69:.:.:0,69,1251    0/0:37,0:37:42:.:.:0,42,1046    0/0:41,0:41:72:.:.:0,72,1109    0/0:164,0:164:99:.:.:0,120,1800 0/0:28,0:28:72:.:.:0,72,1080    0/0:16,0:16:13:.:.:0,13,504     0/0:42,0:42:22:.:.:0,22,1057    0/0:76,0:76:99:.:.:0,100,1800   0/0:76,0:76:99:.:.:0,120,1800   0/0:54,0:54:99:.:.:0,104,1670   0/0:62,0:62:0:.:.:0,0,1316      0/0:29,0:29:52:.:.:0,52,875     0/0:38,0:38:0:.:.:0,0,893       0/0:86,0:86:0:.:.:0,0,1970      0/0:35,0:35:83:.:.:0,83,992     0/0:89,0:89:90:.:.:0,90,2590    0/0:24,0:24:35:.:.:0,35,780     0/0:34,0:34:79:.:.:0,79,951     0/0:58,0:58:75:.:.:0,75,1602    0/0:78,0:78:54:.:.:0,54,2048    0/0:40,0:40:24:.:.:0,24,1023    0/0:95,0:95:0:.:.:0,0,2543      0/0:44,0:44:11:.:.:0,11,1179    0/0:123,0:123:57:.:.:0,57,3377  0/0:43,0:43:99:.:.:0,113,1451   0/0:56,0:56:99:.:.:0,120,1800   0/0:28,0:28:48:.:.:0,48,841     0/0:65,0:65:0:.:.:0,0,1420      0/0:53,0:53:0:.:.:0,0,924       0/0:56,0:56:99:.:.:0,108,1499   0/1:23,36:59:99:0|1:41554523_GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC_G:125,0,636   0/0:27,0:27:81:.:.:0,81,851     0/0:45,0:45:56:.:.:0,56,1189    0/0:50,0:50:78:.:.:0,78,1451    0/0:37,0:37:75:.:.:0,75,1068    0/0:54,0:54:0:.:.:0,0,1340      0/0:37,0:37:24:.:.:0,24,915     0/0:74,0:74:85:.:.:0,85,2078    0/0:31,0:31:90:.:.:0,90,1350    0/0:48,0:48:94:.:.:0,94,1258    0/0:42,0:42:99:.:.:0,105,1194   0/0:35,0:35:99:.:.:0,99,1485    0/0:35,0:35:77:.:.:0,77,1033    0/0:37,0:37:80:.:.:0,80,1076    0/0:68,0:68:99:.:.:0,104,1800   0/0:42,0:42:99:.:.:0,107,1404   0/0:57,0:57:99:.:.:0,118,1588   0/0:33,0:33:99:.:.:0,99,970     0/0:51,0:51:99:.:.:0,103,1289
mmoisse commented 4 years ago

I believe the problem are the gno_genome_an_ fields e.g. gno_genome_an_afr has multiple values (5824, 1920) and vcf2db.py can not handle that. You could solve it by removing the field or one of the values see my previous post.

huangk3 commented 4 years ago

Thanks @mmoisse The error is fixed by removing the second value of these fields. The root of the problem is that there are duplicate records in the gnomad genome VCF:

21     41554523      **rs114481025**     GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC       G
21     41554523      rs114481025   G      C
21     41554523      rs114481025   GCAGAGAAACCGTGGACAGAACGGGCCAC      G
21     41554523      rs114481025   G      GCAGAGAAACCGTGGACAGAACGGGCCAC
21     41554523      **rs34163425**       GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC       G

vcfanno concatenated the allele numbers(ANs) from rs114481025 and rs34163425 with "op=["self"]". The error is gone after I set "op=["max"]"

erinijapranckeviciene commented 3 years ago

Hello,

I am experiencing similar issue. We have multiple exomes annotated with VEP from which we create a multisample vcf using bcftools merge. After the merge this multisample vcf is decomposed with vt decompose -s and is input to vcf2db.py to create a GEMINI db. Some sites previously multiallelic during the process generate error as is discussed in this issue here.

I can't figure out what is wrong. Your help is very much appreciated. I attach here the vcf.gz and vcf.gz.tbi and ped of 4 samples with only those two lines that prevent from loading. All in zip file.

If it would be possible to identify which field impairs the loading and how , then we would take care of it before using vcf2db.py .

Many thanks in advance!

n.vcf.gz.zip.zip

Update: While asking for help, figured it out myself :) . In my case for multiallelic variants from the merge the INFO AC field gets two values, but it is defined as Number=1 . Changing into Number=. allows vcf2db.py to upload my multisample vcf into the GEMINI db.