samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
643 stars 241 forks source link

bcftools-1.10.2 +split-vep -f "[%AF]" outputs concatenated float values for each sampe, hence with multiple decimal dots #1358

Closed mmokrejs closed 3 years ago

mmokrejs commented 3 years ago

Hi, I am maybe twisting use of bcftools as I served it with VCF file output by GATK-4.1.0.9 Mutect2, containing normal and a tumor sample pair. I guess bcftools is parsing INFO fields for both samples and somehow mixes the results together. My aim was to annotate the CSQ values but admittedly I forgot that there are two samples in the input. Should bcftools +split-vep complain loudly if two samples are in the input? I do not see in the manpage or runtime help text how to parse only values for a certain sample in the context of +split-vep plugin, actually, would it make sense? If I am not mistaken it would be helpful to install a check into the software and also, improve the documentation.

The example below is a bit complicated as bcftools-1.10.2 +split-vep requires me to enumerate all the fields appearing in the input to be parsed, while spitting out a confusing message IMO

[filter.c:2491 filters_init1] Error: the tag "IMPACT" is not defined in the VCF header

The message should be rephrased, it complains about IMPACT missing from the -f "%CHROM:%POS %ID %REF %ALT [%AD] [%AF]\n" argument, do not know why. The field IMPACT is present in the VCF input file: gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txt

Due to that, I keep the long list of columns I wanted to have printed out initially, although it could be shortened for demonstration purposes I assume.

In the below output please note 0.040.048 in th 6th column, originating from [%AF]. That originates from 0.04 of the first sample and from 0.048 of the second sample.

$ bcftools +split-vep -f "$columns_mutect2\n" -d -s worst:missense+ -i 'IMPACT~"HIGH"' /tmp/gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txt
Warning: duplicate INFO/CSQ key "APPRIS"
Warning: duplicate INFO/CSQ key "TSL"
chr17:47169996 . G A 169,6145,6 0.040.048 169,6145,6 0/00/1 339 0.1367 stop_gained&nmd_transcript_variant HIGH CDC27 ENSG00000004897 Transcript ENST00000526866 nonsense_mediated_decay 4/12 . ENST00000526866.5:c.298C>T ENSP00000432105.1:p.Gln100Ter 390 298 100 Q/* Cag/Tag rs73319503 . -1 . SNV HGNC HGNC:1728 . . . ENSP00000432105 . F6QPS0 UPI0001AE6709 . . . . . . . . 0.1367 . . . . . . . . . . . . 37 . . . ENSG00000004897&ENSG00000004897&ENSG00000004897&ENSG00000004897 ENSP00000066544&ENSP00000434614&ENSP00000437339&ENSP00000460454 ENST00000066544&ENST00000531206&ENST00000527547&ENST00000575483 24684 2.033e-01 11353 1.922e-01 22812 2.148e-01 20708 2.209e-01 9988 2.099e-01 . .&.&.&. .&.&.&. Y&Y&Y&. . . 0.74766 . c.298C>T&c.298C>T&c.298C>T&. c.298C>T&c.298C>T&c.298C>T&c.298C>T c.298C>T&c.298C>T&c.298C>T&c.298C>T p.Q100X&p.Q100X&p.Q100X&. p.Gln100Ter&p.Gln100Ter&p.Gln100Ter&p.Gln100Ter p.Gln100*&p.Gln100*&p.Gln100*&p.Gln100* .&.&.&. . . . . . . . .&.&.&. . .&.&.&. Q100*&Q100*&Q100*&Q39* 0.81001 complex_aae&complex_aae&complex_aae&complex_aae A&A&A&A 1&1&1&1 . .&.&.&. .&.&.&. .&.&.&. . .&.&.&. .&.&.&. . .&.&.&. . . . . .&.&.&. .&.&.&. . .&.&.&. .&.&.&. P30260&P30260-2&G5EA36&I3L3H6 CDC27_HUMAN&CDC27_HUMAN&G5EA36_HUMAN&I3L3H6_HUMAN .&YES&.&. X 100&100&100&100 Q A -&-&-&- . . . . . . . . . 0&0&0&0 1&1&1&1 CDC27&CDC27&CDC27&CDC27 . . . . . . . . . . . . . . . . . . 0.55065 0.675202 1.000000 0.71638 0.998000 0.85391 1.000000 0.86279 9.396000 0.96625 0.672000 0.70159 1.172000 0.72986 G CAG&CAG&CAG&CAG rs73319503
$

$ echo $columns_mutect2
%CHROM:%POS %ID %REF %ALT [%AD] [%AF] [%AD] [%GT] %DP %gnomAD_NFE_AF %Consequence %IMPACT %SYMBOL %Gene %Feature_type %Feature %BIOTYPE %EXON %INTRON %HGVSc %HGVSp %cDNA_position %CDS_position %Protein_position %Amino_acids %Codons %Existing_variation %DISTANCE %STRAND %FLAGS %VARIANT_CLASS %SYMBOL_SOURCE %HGNC_ID %CANONICAL %MANE %CCDS %ENSP %SWISSPROT %TREMBL %UNIPARC %GENE_PHENO %SIFT %PolyPhen %DOMAINS %miRNA %HGVS_OFFSET %AF %EUR_AF %gnomAD_NFE_AF %CLIN_SIG %SOMATIC %PHENO %PUBMED %VAR_SYNONYMS %MOTIF_NAME %MOTIF_POS %HIGH_INF_POS %MOTIF_SCORE_CHANGE %TRANSCRIPTION_FACTORS %1000Gp3_EUR_AC %1000Gp3_EUR_AF %CADD_phred %ClinPred_pred %ClinPred_rankscore %ClinPred_score %Ensembl_geneid %Ensembl_proteinid %Ensembl_transcriptid %ExAC_AC %ExAC_AF %ExAC_NFE_AC %ExAC_NFE_AF %ExAC_nonTCGA_AC %ExAC_nonTCGA_AF %ExAC_nonTCGA_Adj_AC %ExAC_nonTCGA_Adj_AF %ExAC_nonTCGA_NFE_AC %ExAC_nonTCGA_NFE_AF %FATHMM_converted_rankscore %FATHMM_pred %FATHMM_score %GENCODE_basic %GTEx_V8_gene %GTEx_V8_tissue %GenoCanyon_rankscore %Geuvadis_eQTL_target_gene %HGVSc_ANNOVAR %HGVSc_VEP %HGVSc_snpEff %HGVSp_ANNOVAR %HGVSp_VEP %HGVSp_snpEff %Interpro_domain %LINSIGHT %LINSIGHT_rankscore %MutPred_AAchange %MutPred_Top5features %MutPred_protID %MutPred_rankscore %MutPred_score %MutationAssessor_pred %MutationAssessor_rankscore %MutationAssessor_score %MutationTaster_AAE %MutationTaster_converted_rankscore %MutationTaster_model %MutationTaster_pred %MutationTaster_score %PROVEAN_converted_rankscore %PROVEAN_pred %PROVEAN_score %Polyphen2_HDIV_pred %Polyphen2_HDIV_rankscore %Polyphen2_HDIV_score %Polyphen2_HVAR_pred %Polyphen2_HVAR_rankscore %Polyphen2_HVAR_score %PrimateAI_pred %PrimateAI_rankscore %PrimateAI_score %SIFT4G_converted_rankscore %SIFT4G_pred %SIFT4G_score %SIFT_converted_rankscore %SIFT_pred %SIFT_score %Uniprot_acc %Uniprot_entry %VEP_canonical %aaalt %aapos %aaref %alt %cds_strand %clinvar_MedGen_id %clinvar_OMIM_id %clinvar_Orphanet_id %clinvar_clnsig %clinvar_hgvs %clinvar_id %clinvar_review %clinvar_trait %clinvar_var_source %codon_degeneracy %codonpos %genename %gnomAD_exomes_AC %gnomAD_exomes_AF %gnomAD_exomes_NFE_AC %gnomAD_exomes_NFE_AF %gnomAD_exomes_NFE_AN %gnomAD_exomes_NFE_nhomalt %gnomAD_exomes_controls_AC %gnomAD_exomes_controls_AF %gnomAD_exomes_controls_NFE_AC %gnomAD_exomes_controls_NFE_AF %gnomAD_exomes_controls_NFE_AN %gnomAD_exomes_controls_NFE_nhomalt %gnomAD_genomes_AC %gnomAD_genomes_AF %gnomAD_genomes_NFE_AC %gnomAD_genomes_NFE_AF %gnomAD_genomes_NFE_AN %gnomAD_genomes_NFE_nhomalt %integrated_fitCons_rankscore %integrated_fitCons_score %phastCons100way_vertebrate %phastCons100way_vertebrate_rankscore %phastCons17way_primate %phastCons17way_primate_rankscore %phastCons30way_mammalian %phastCons30way_mammalian_rankscore %phyloP100way_vertebrate %phyloP100way_vertebrate_rankscore %phyloP17way_primate %phyloP17way_primate_rankscore %phyloP30way_mammalian %phyloP30way_mammalian_rankscore %ref %refcodon %rs_dbSNP151
$

$ grep 47169996 /tmp/gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txt | awk '{print $10,$11}'
0/0:169,6:0.04:175:88,3:81,3:76,93,4,2 0/1:145,6:0.048:151:58,3:86,3:69,76,4,2
$

As I said above, the below error message does not actually complain about the input VCF file but about the fields I have omitted from the -f argument, don't know why.

$ bcftools +split-vep -f "%CHROM:%POS %ID %REF %ALT [%AD] [%AF]\n" -d -s worst:missense+ -i 'IMPACT~"HIGH"' /tmp/gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txt
Warning: duplicate INFO/CSQ key "APPRIS"
Warning: duplicate INFO/CSQ key "TSL"
[filter.c:2491 filters_init1] Error: the tag "IMPACT" is not defined in the VCF header
$

gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txt

dlaehnemann commented 3 years ago

I think you simply need to add a whitespace or tab in there. So this should work as expected (note that I moved the whitespace from before the [] brackets inside them):

bcftools +split-vep -f "%CHROM:%POS %ID %REF %ALT[ %AD][ %AF]\n" -d -s worst:missense+ -i 'IMPACT~"HIGH"' /tmp/gatk.Mutect2.FilterMutectCalls.vep101.bcftools_filter_by_depth.txt
mmokrejs commented 3 years ago

Thank you, this [ %AF] [ %GT] helps a bit. But I get two spaces as a field separator in the output. Seems this is also a problem for [%GT] field. Should probably comma used to separate the values for each sample?

mmokrejs commented 3 years ago

Hmm, [,%AF] yields ,0.04,0.048, with a leading comma. Same for semicolon.

dlaehnemann commented 3 years ago

Yeah, for the FORMAT fields you have to always include the separator inside the [] brackets as a leading character and then ensure that the same separator is not included before the brackets, so for your example two comments above this would be:

[ %AF][ %GT]

The expression simply does exactly what you specify and with the extra whitespace between the brackets (i.e. ] [) will generate the double whitespace in front of the first GT entry in the output.

And which separator you want to use, depends on what you want to do downstream.

mmokrejs commented 3 years ago

Could I prevent the leading separator instance from appearing? My aim is to get a tabular listing and then to import it into say, Excel. So, first to split by a space and then sub-split by comma or semicolon. The leading e.g. comma could be removed by '/ ,/ /' regexp replacement but is that really necessary?

dlaehnemann commented 3 years ago

Depending on what you want to do downstream, you might also consider having one line per sample and site, which would be a tidy data format -- this would circumvent the need to have several levels per line to deparse.

In any case, I think the examples over at the bcftools query docs might help you further.

And I think the behaviour you see is not actually a bug, but the intended behaviour. So unless I am not seeing something here or you want to indeed suggest some particular change in behaviour, I'd suggest closing this issue if you can solve your problem with the docs and the suggestions here.

mmokrejs commented 3 years ago

So why does the leading sepatrator char gets printed at all?

Also, the [filter.c:2491 filters_init1] Error: the tag "IMPACT" is not defined in the VCF header is bogus IMO.

pd3 commented 3 years ago

There seem to be two issues here. Regarding the main error, please try with the latest version of bcftools, there were fixes and improvements.

The other issue is about query formatting; admittedly this could have been done better with the leading character not printed for the first sample. Unfortunately this is too late to change now as it would break backward compatibility. It is easy to work around this by following @dlaehnemann's suggestion: bcftools query -f '%CHROM\t%POS[\t%GT]\n' yields columns separated by a single tab. If there is no other output before the square brackets, you can always stream through sed bcftools query -f '[\t%GT]\n' | sed 's,^\s\s*,,'

pd3 commented 3 years ago

This seems to be resolved, closing now. Please open a new ticket if not.