samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
672 stars 240 forks source link

bcftools csq results in some cases in ncsq "1,.,0" #1428

Closed MWendorff closed 3 years ago

MWendorff commented 3 years ago

Hi, I used csq on a big dataset and in general it is working great. But there are some cases were I am confused by the FORMAT/BCSQ field.

Here is one example:

13 102732665 13_102732665_A_G A G 3000 PASS AF=0.999494;AQ=3000;ExcessHet=3.011;QD=0;AC=2;AN=2;BCSQ=*missense|CCDC168|ENST00000322527|protein_coding|-|6011L>6011P|102732665A>G,@102737762,@102739441,@102733857,@102741191,@102742699,@102734366,@102734901,@102746627,missense|CCDC168|ENST00000322527|protein_coding|-|6011L>6010P|102732665A>G,@102742972,@102750067,@102733873,@102734279,@102734620,@102736789,@102737126,@102737852,@102750334,@102741579,@102747336,@102744330,@102742686,@102745135,@102749788,@102732961,@102733083,@102734213,@102734676,@102736249,@102737649,@102738443,@102742472,@102736445,@102737048,@102748934,@102738554,@102739005,@102739198,@102739283,@102742519,@102743029,@102743256,@102745938,@102750179,@102747790 GT:AB:AD:DP:GQ:PL:RNC:SBPV:BCSQ 1|1:.:0,55:55:99:1974,160,0:..:0.5303:1,.,0

I also extracted the per samples consequence:

13 102732665 KIEL_PSO00156_0212925429 *missense|CCDC168|ENST00000322527|protein_coding|-|6011L>6011P|102732665A>G @102738443

Therefore, it looks like 1,.,0 is interpreted as consequence 32 and consequence 1, but I do not understand why there is the "." at all and why it is interpreted as a number.

Best Mareike

pd3 commented 3 years ago

Consequences for each haplotype are recorded in a per-sample FORMAT tag as a bitmask of indexes into the list of consequences recorded in the INFO tag. The bitmask interleaves each haplotype so that when stored in BCF (binary VCF) format, only 8 bits per sample are required for most sites. The bitmask can be translated into a human readable form using the BCFtools/query command.

Please see the TBCSQ key to the query command and more documentation here: http://samtools.github.io/bcftools/howtos/csq-calling.html http://samtools.github.io/bcftools/bcftools.html#csq https://academic.oup.com/bioinformatics/article/33/13/2037/3000373

If anything is unclear please let us know.

MWendorff commented 3 years ago

Hi Petr Thanks for your response. But unfortunately it does not help me. I was aware of the TBCSQ. But I am still confused by the ncsq field containing a ".".

We already digged deaper into the BCSQ files and in normal cases (all fields contain integers) the parsing of the single ncsq fields is clear.

But I do not understand at all why the "." in a csq file might be correct. For me it looks like a bug. Why is there a mixture of numbers with "." and how to translate '.' into an integer?

pd3 commented 3 years ago

Sorry, I misunderstood your question. This can be an error in assigning a bitmask (an integer value) equal to internal representation for the missing value (also an integer value).

Any chance you could provide a small test case to reproduce the problem?

pd3 commented 3 years ago

This could be related to this bug https://github.com/samtools/bcftools/issues/1429 which was just fixed. Can you please that the commit d1f7494 fixes your problem as well?

pd3 commented 3 years ago

Thank you for the test case. The problem was caused by incorrect handling of the --ncsq parameter, reserved BCF values were not considered. The problem should be fixed now.

Sadly, note that in files produced prior to this fix, any FMT/BCSQ fields involving multiple values might have been compromised by this bug.

MWendorff commented 3 years ago

Hi Petr, Thanks, a lot for your support! With the new version the test case seems to run without any problems. I am now looking forward to analyse the results further ;-)

Best Mareike