samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

split-vep incorrectly outputs cDNA, CDS and protein positions #2079

Closed bartcharbon closed 8 months ago

bartcharbon commented 8 months ago

I have a annotated VCF I split using bcftools split-vep

the cDNA, CDS en protein positions in the CSQ are: |8586-8599/9231|8523-8536/8835|2841-2846/2944 |

The split vep output for those fields in bcftools 1.19 is 8586 8523 2841

in bcftools 1.17 the output was correct: 8586-8599/9231 8523-8536/8835 2841-2846/2944

pd3 commented 8 months ago

Can you show the VEP header line and the data line please?

bartcharbon commented 8 months ago

Header line: ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|PICK|SYMBOL_SOURCE|HGNC_ID|REFSEQ_MATCH|REFSEQ_OFFSET|SOURCE|SIFT|PolyPhen|HGVS_OFFSET|CLIN_SIG|SOMATIC|PHENO|PUBMED|CHECK_REF|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|Grantham|SpliceAI_pred_DP_AG|SpliceAI_pred_DP_AL|SpliceAI_pred_DP_DG|SpliceAI_pred_DP_DL|SpliceAI_pred_DS_AG|SpliceAI_pred_DS_AL|SpliceAI_pred_DS_DG|SpliceAI_pred_DS_DL|SpliceAI_pred_SYMBOL|CAPICE_CL|CAPICE_SC|existing_InFrame_oORFs|existing_OutOfFrame_oORFs|existing_uORFs|five_prime_UTR_variant_annotation|five_prime_UTR_variant_consequence|IncompletePenetrance|InheritanceModesGene|VKGL|VKGL_CL|gnomAD_AF|gnomAD_COV|gnomAD_FAF95|gnomAD_FAF99|gnomAD_HN|gnomAD_QC|gnomAD_SRC|clinVar_CLNID|clinVar_CLNREVSTAT|clinVar_CLNSIG|clinVar_CLNSIGINCL|ASV_ACMG_class|ASV_AnnotSV_ranking_criteria|ASV_AnnotSV_ranking_score|ALPHSCORE|ncER|phyloP">

Data line: chr3 48565192 . GGTACCCGCTCTGCAGGTAGGGCAGGGTGTGCTGGGAGCAGTGGCTGCTGGCCCCGGGGCAAGGTGGGCAGCACTGATTTCCACTGTGTGCACACAGTGCCCATGCGTGTGCCCTGCATGCAGACCCTACGTGCTTGGCGTGTGCCCTGCATTCATGGACACCCATGTGCGTGTCTCGGCCCCACCCATAGCTGCCCCACGGGTTCAGCTGTCCTCACCTTCC G . PASS CSQ=-|splice_acceptor_variant&splice_donor_variant&frameshift_variant&stop_lost&splice_donor_5th_base_variant&intron_variant|HIGH|COL7A1|1294|Transcript|NM_000094.4|protein_coding|116-117/119|116/118|NM_000094.4:c.8523_8536del|NP_000085.1:p.Glu2841AspfsTer3|8586-8599/9231|8523-8536/8835|2841-2846/2944|EEGEDS*TRGAAMGGAETRTWVSMNAGHTPST*GLHAGHTHGHCVHTVEISAAHLAPGPAATAPSTPCPTCRAGTX/DX|gaGGAAGGTGAGGACAGCTGAACCCGTGGGGCAGCTATGGGTGGGGCCGAGACACGCACATGGGTGTCCATGAATGCAGGGCACACGCCAAGCACGTAGGGTCTGCATGCAGGGCACACGCATGGGCACTGTGTGCACACAGTGGAAATCAGTGCTGCCCACCTTGCCCCGGGGCCAGCAGCCACTGCTCCCAGCACACCCTGCCCTACCTGCAGAGCGGGTACcc/gacc||1||-1||1|EntrezGene||||||||||||||||||||||||||||VUS|0.5681088|||||||AD&AR||||||||||||||4|1A_(cf_Gene_count%2C_RE_gene%2C_+0.00)%3B2E-1_(COL7A1%2C_+0.90)%3B3A_(1_gene%2C_+0.00)%3B5F_(+0.00)|0.9||99.7739|,-|upstream_gene_variant|MODIFIER|PFKFB4|5210|Transcript|NM_001317136.2|protein_coding|||||||||||1|4064|-1|||EntrezGene||||||||||||||||||||||||||||VUS|0.5681742|||||||||||||||||||||||||99.7739|,-|upstream_gene_variant|MODIFIER|UCN2|90226|Transcript|NM_033199.4|protein_coding|||||||||||1|1412|-1|||EntrezGene||||||||||||||||||||||||||||VUS|0.5681742|||||||||||||||||||||||||99.7739| GT 1/1

pd3 commented 8 months ago

This is tied to the automatic type parsing introduced in https://github.com/samtools/bcftools/commit/2191405e8afd9b123d18dc7084459d409afc4ea4, where fields like cDNA_position are assumed to be integers. Your example shows that the assumption is incorrect, therefore we will set the automatic type to String.

In both versions one can enforce the desired type with -c cDNA_position:int or -c cDNA_position:string.

Newly a warning is printed when a numeric type cannot be parsed fully.

Thank you for the bug report