Closed MWendorff closed 3 years ago
Consequences for each haplotype are recorded in a per-sample FORMAT tag as a bitmask of indexes into the list of consequences recorded in the INFO tag. The bitmask interleaves each haplotype so that when stored in BCF (binary VCF) format, only 8 bits per sample are required for most sites. The bitmask can be translated into a human readable form using the BCFtools/query command.
Please see the TBCSQ key to the query
command and more documentation here:
http://samtools.github.io/bcftools/howtos/csq-calling.html
http://samtools.github.io/bcftools/bcftools.html#csq
https://academic.oup.com/bioinformatics/article/33/13/2037/3000373
If anything is unclear please let us know.
Hi Petr Thanks for your response. But unfortunately it does not help me. I was aware of the TBCSQ. But I am still confused by the ncsq field containing a ".".
We already digged deaper into the BCSQ files and in normal cases (all fields contain integers) the parsing of the single ncsq fields is clear.
But I do not understand at all why the "." in a csq file might be correct. For me it looks like a bug. Why is there a mixture of numbers with "." and how to translate '.' into an integer?
Sorry, I misunderstood your question. This can be an error in assigning a bitmask (an integer value) equal to internal representation for the missing value (also an integer value).
Any chance you could provide a small test case to reproduce the problem?
This could be related to this bug https://github.com/samtools/bcftools/issues/1429 which was just fixed. Can you please that the commit d1f7494 fixes your problem as well?
Thank you for the test case. The problem was caused by incorrect handling of the --ncsq parameter, reserved BCF values were not considered. The problem should be fixed now.
Sadly, note that in files produced prior to this fix, any FMT/BCSQ fields involving multiple values might have been compromised by this bug.
Hi Petr, Thanks, a lot for your support! With the new version the test case seems to run without any problems. I am now looking forward to analyse the results further ;-)
Best Mareike
Hi, I used csq on a big dataset and in general it is working great. But there are some cases were I am confused by the FORMAT/BCSQ field.
Here is one example:
13 102732665 13_102732665_A_G A G 3000 PASS AF=0.999494;AQ=3000;ExcessHet=3.011;QD=0;AC=2;AN=2;BCSQ=*missense|CCDC168|ENST00000322527|protein_coding|-|6011L>6011P|102732665A>G,@102737762,@102739441,@102733857,@102741191,@102742699,@102734366,@102734901,@102746627,missense|CCDC168|ENST00000322527|protein_coding|-|6011L>6010P|102732665A>G,@102742972,@102750067,@102733873,@102734279,@102734620,@102736789,@102737126,@102737852,@102750334,@102741579,@102747336,@102744330,@102742686,@102745135,@102749788,@102732961,@102733083,@102734213,@102734676,@102736249,@102737649,@102738443,@102742472,@102736445,@102737048,@102748934,@102738554,@102739005,@102739198,@102739283,@102742519,@102743029,@102743256,@102745938,@102750179,@102747790 GT:AB:AD:DP:GQ:PL:RNC:SBPV:BCSQ 1|1:.:0,55:55:99:1974,160,0:..:0.5303:1,.,0
I also extracted the per samples consequence:
13 102732665 KIEL_PSO00156_0212925429 *missense|CCDC168|ENST00000322527|protein_coding|-|6011L>6011P|102732665A>G @102738443
Therefore, it looks like 1,.,0 is interpreted as consequence 32 and consequence 1, but I do not understand why there is the "." at all and why it is interpreted as a number.
Best Mareike