quinlan-lab / vcf2db

create a gemini-compatible database from a VCF
MIT License
55 stars 13 forks source link

Constraints violation when loading a merged VCF #55

Open lbeltrame opened 5 years ago

lbeltrame commented 5 years ago

I'm hitting this repeatedly when trying to load VCFs that have been merged via bcftools merge then normalized and decomposed via vt and then annotated via vcfanno. Note that each of these had been annotated with VEP prior to merging.

It's absolutely unclear what causes it.


Traceback (most recent call last):
  File "/opt/bcbio/anaconda/bin/vcf2db.py", line 924, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/opt/bcbio/anaconda/bin/vcf2db.py", line 234, in __init__
    self.load()
  File "/opt/bcbio/anaconda/bin/vcf2db.py", line 319, in load
    i = self._load(self.cache, create=True, start=1)
  File "/opt/bcbio/anaconda/bin/vcf2db.py", line 312, in _load
    self.insert(variants, expanded, keys, i, create=create)
  File "/opt/bcbio/anaconda/bin/vcf2db.py", line 374, in insert
    vilengths, variant_impacts)
  File "/opt/bcbio/anaconda/bin/vcf2db.py", line 402, in _insert
    self.__insert(v_objs, self.metadata.tables['variants'].insert())
  File "/opt/bcbio/anaconda/bin/vcf2db.py", line 444, in __insert
    trans.execute(stmt, o)
  File "/opt/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 948, in execute
    return meth(self, multiparams, params)
  File "/opt/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/opt/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/opt/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
    context)
  File "/opt/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
    exc_info
  File "/opt/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/opt/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
    context)
  File "/opt/bcbio/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 507, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) CHECK constraint failed: variants [SQL: u'INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, aa, af, altaineandertal, ancestral_allele, cadd_phred, cadd_raw, cadd_raw_rankscore, caf, cds, clnsig, cnt, dann_rankscore, dann_score, db, dp, denisova, eigen_pc_phred, eigen_pc_raw, eigen_pc_raw_rankscore, eigen_phred, eigen_raw, eigen_coding_or_noncoding, "ensembl?", ensembl_functional_consequence, ensembl_gene, ensembl_geneid, ensembl_id_c_change_p_change, ensembl_proteinid, ensembl_region, ensembl_transcriptid, fathmm_converted_rankscore, fathmm_pred, fathmm_score, geneinfo, "gerp++_nr", "gerp++_rs", "gerp++_rs_rankscore", gm12878_confidence_value, gm12878_fitcons_score, gm12878_fitcons_score_rankscore, gtex_v6p_gene, gtex_v6p_tissue, genocanyon_score, genocanyon_score_rankscore, h1_hesc_confidence_value, h1_hesc_fitcons_score, h1_hesc_fitcons_score_rankscore, huvec_confidence_value, huvec_fitcons_score, huvec_fitcons_score_rankscore, interpro_domain, lof, lrt_omega, lrt_converted_rankscore, lrt_pred, lrt_score, lseq, m_cap_pred, m_cap_rankscore, m_cap_score, msi, msilen, metalr_pred, metalr_rankscore, metalr_score, metasvm_pred, metasvm_rankscore, metasvm_score, mutpred_aachange, mutpred_top5features, mutpred_protid, mutpred_rankscore, mutpred_score, mutationassessor_uniprotid, mutationassessor_pred, mutationassessor_score, mutationassessor_score_rankscore, mutationassessor_variant, mutationtaster_aae, mutationtaster_converted_rankscore, mutationtaster_model, mutationtaster_pred, mutationtaster_score, nmd, provean_converted_rankscore, provean_pred, provean_score, revel_rankscore, revel_score, rseq, "refseq?", refseq_functional_consequence, refseq_gene, refseq_id_c_change_p_change, refseq_region, reliability_index, sample, shift3, somatic, sor, ssf, status, siphy_29way_logodds, siphy_29way_logodds_rankscore, siphy_29way_pi, transcript_id_vest3, transcript_var_vest3, vd, vest3_rankscore, vest3_score, ac_adj_exac_afr, ac_adj_exac_amr, ac_adj_exac_eas, ac_adj_exac_fin, ac_adj_exac_nfe, ac_adj_exac_oth, ac_adj_exac_sas, ac_exac_all, ada_score, af_1kg_afr, af_1kg_all, af_1kg_amr, af_1kg_eas, af_1kg_eur, af_1kg_sas, af_adj_exac_afr, af_adj_exac_amr, af_adj_exac_eas, af_adj_exac_fin, af_adj_exac_nfe, af_adj_exac_oth, af_adj_exac_sas, af_esp_aa, af_esp_all, af_esp_ea, af_exac_all, an_adj_exac_afr, an_adj_exac_amr, an_adj_exac_eas, an_adj_exac_fin, an_adj_exac_nfe, an_adj_exac_oth, an_adj_exac_sas, an_exac_all, cds_strand, clinvar_disease_name, clinvar_pathogenic, clinvar_sig, codon_degeneracy, codonpos, common_pathogenic, cosmic_ids, cpg_island, cse_hiseq, dgv, encode_consensus_gm12878, encode_consensus_h1hesc, encode_consensus_helas3, encode_consensus_hepg2, encode_consensus_huvec, encode_consensus_k562, fathmm_mkl_coding_group, fathmm_mkl_coding_pred, fathmm_mkl_coding_rankscore, fathmm_mkl_coding_score, fitcons, gerp_elements, gnomad_ac, gnomad_ac_amr, gnomad_ac_asj, gnomad_ac_eas, gnomad_ac_fin, gnomad_ac_nfe, gnomad_ac_oth, gnomad_ac_popmax, gnomad_ac_sas, gnomad_af, gnomad_af_afr, gnomad_af_amr, gnomad_af_asj, gnomad_af_eas, gnomad_af_fin, gnomad_af_nfe, gnomad_af_oth, gnomad_af_popmax, gnomad_af_sas, gnomad_an, gnomad_an_amr, gnomad_an_asj, gnomad_an_eas, gnomad_an_fin, gnomad_an_nfe, gnomad_an_oth, gnomad_an_popmax, gnomad_an_sas, gnomad_gc, gnomad_gc_female, gnomad_gc_male, gnomad_hom, gnomad_hom_female, gnomad_hom_male, gnomad_popmax, gnomad_exomes_ac, gnomad_exomes_af, gnomad_exomes_an, gnomad_genomes_ac, gnomad_genomes_af, gnomad_genomes_an, hapmap1, hapmap2, integrated_confidence_value, integrated_fitcons_score, integrated_fitcons_score_rankscore, max_aaf_all, num_exac_het, num_exac_hom, phastcons100way_vertebrate, phastcons100way_vertebrate_rankscore, phastcons20way_mammalian, phastcons20way_mammalian_rankscore, phylop100way_vertebrate, phylop100way_vertebrate_rankscore, phylop20way_mammalian, phylop20way_mammalian_rankscore, refcodon, rf_score, rmsk, rs_ids, stam_mean, stam_names, tfbs, allele, feature_type, intron, hgvsc, hgvsp, cdna_position, cds_position, existing_variation, allele_num, distance, strand, flags, variant_class, symbol_source, hgnc_id, tsl, appris, ccds, ensp, swissprot, trembl, uniparc, refseq_match, source, given_ref, used_ref, bam_edit, gene_pheno, domains, hgvs_offset, afr_af, amr_af, eas_af, eur_af, sas_af, aa_af, ea_af, gnomad_afr_af, gnomad_amr_af, gnomad_asj_af, gnomad_eas_af, gnomad_fin_af, gnomad_nfe_af, gnomad_oth_af, gnomad_sas_af, max_af, max_af_pops, clin_sig, pheno, pubmed, motif_name, motif_pos, high_inf_pos, motif_score_change, lof_filter, lof_flags, lof_info, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (1, u'chr1', 11166712, 11166713, u'rs2536', u'T', u'C', 266.0, u'REJECT', 'snp', 'ts', 0.07317073170731707, 1, 11, 0, 152, 0.4583333333333333, u'MTOR', None, u'ENST00000361445', 1, 0, 0, 0, 1, u'58/58', '', '', u'', u'protein_coding', u'3_prime_UTR_variant', u'3_prime_UTR_variant', 'LOW', u'', None, u'', None, None, u'0.0970', None, None, None, None, None, u'0.903,0.09704', None, 'None', None, None, None, 1, 11178, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, u'', None, None, None, None, u'TATTTGTTCTGCTCATAATT', None, None, None, 2.0, 1.0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'None', None, None, None, None, None, u'CCAATATGTACCAGACCTTC', None, None, None, None, None, None, u'mito_mango_63', 0, u'', inf, 0.0, u'StrongSomatic', None, None, None, None, None, 326, None, None, None, None, None, None, None, None, None, None, None, 0.1316000074148178, 0.09700000286102295, 0.07199999690055847, 0.08529999852180481, 0.024900000542402267, 0.15440000593662262, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, None, None, None, None, 0, None, 0, 0, None, u'T', u'T', u'T', u'T', u'T', u'T', None, None, None, None, 0.29510000348091125, None, 1797, 63, 20, 151, 213, 359, 54, 937, None, -1.0, 0.10740000009536743, 0.07519999891519547, 0.06620000302791595, 0.093299999833107, 0.061000000685453415, 0.023900000378489494, 0.054999999701976776, 0.10740000009536743, -1.0, 30974, 838, 302, 1618, 3494, 15018, 982, 8722, None, 15487.0, 6928.0, 8559.0, None, None, None, None, None, -1.0, None, None, -1.0, None, 0.07020000368356705, 23.648700714111328, None, None, None, 0.15440000593662262, None, None, None, None, None, None, None, None, None, None, None, None, None, u'rs2536', None, None, None, u'C', u'Transcript', u'', u'ENST00000361445.4:c.*829A>G', u'', u'8556/8677', u'', u'rs2536', u'1', u'', u'-1', u'', u'SNV', u'HGNC', u'3942', u'', u'', u'CCDS127.1', u'ENSP00000354558', u'P42345', u'Q96QW8&B1AKQ2&B1AKP8', u'UPI000012ABD3', u'', u'Ensembl', u'T', u'T', u'', u'1', u'', u'', u'0.1316', u'0.072', u'0.0853', u'0.0249', u'0.1544', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'0.1544', u'SAS', u'', u'', u'21973240&22815832&23209702&23423739&23524405&24816861&27462867&27533457&28280736&15720714&18545701', u'', u'', u'', u'', u'', u'', u'', <read-only buffer for 0x7fa9a86b2b28, size -1, offset 0 at 0x7fa99f6445f0>, <read-only buffer for 0x7fa9a86a3738, size -1, offset 0 at 0x7fa99f644630>, <read-only buffer for 0x7fa9a86c4ed8, size -1, offset 0 at 0x7fa99f644670>, <read-only buffer for 0x7fa9a86cb330, size -1, offset 0 at 0x7fa99f6446b0>, <read-only buffer for 0x7fa9a86a37b0, size -1, offset 0 at 0x7fa99f6446f0>, <read-only buffer for 0x7fa9a86a3828, size -1, offset 0 at 0x7fa99f644730>, <read-only buffer for 0x7fa9aebf42b0, size -1, offset 0 at 0x7fa99f644770>, <read-only buffer for 0x7fa9a98afb28, size -1, offset 0 at 0x7fa99f6447b0>)] (Background on this error at: http://sqlalche.me/e/gkpj)

The only other thing of note was that the PED was "generated" (I don't need these information) like this:

bcftools query -l data.annot.vcf.gz | awk 'BEGIN{FS="\t"}{print NR,$1,0,0,2,-9 }' > fake.ped
arq5x commented 5 years ago

Check constraints typically arise when the width of the data being inserted exceeds the maximum width of the column defined for the database. My guess would be you have a chromosome label that is longer than 10 characters, but could you check against the max widths of the columns, as defined here: https://github.com/quinlan-lab/vcf2db/blob/master/vcf2db.py#L628-L665?

brentp commented 5 years ago

vcf2db insert many records at a time. when there's an error, it tries to insert 1 at a time so the user gets some context on the exact record that casued a problem. It seems that something about the transaction is not working and so when it tries to insert 1 at a time (after supposedly failing the batch insert) it violates the unique constraint. Can you share a small vcf that recreates this problem?

lbeltrame commented 5 years ago

I don't know what is the exact cause that generates it, but I might be able to generate (and share privately) a VCF that at least here exhibits the issue. Would that be possible? Where should I send it to?

@arq5x I don't think so: it's a regular hg19-based VCF, so I doubt I have that large chromosome labels.

brentp commented 5 years ago

send to bpederse@gmail.com

lbeltrame commented 5 years ago

Sent. Make sure you double check the Spam folder, unfortunately my institution seems to have landed on a few blacklists.