Closed anderspitman closed 1 week ago
The mpileup
program is complex and does lot of things. What specifically are you interested in? For example, the calculation of genotype likelihoods and BAQ calculation is described in http://samtools.github.io/bcftools/samtools.pdf
From a high level, a solid step for us would be to be able to follow the basic calling instructions here: https://samtools.github.io/bcftools/howtos/variant-calling.html
But using the output of our GPU mpileup engine instead of bcftools mpileup
.
In order to accomplish that, I need to understand the output of mpileup. Most of the heavy lifting seems to be done by the I16 field, so I need to understand that at least. But there's probably more.
bam2bcf.h shows:
// The fields are:
// depth fwd .. ref (0) and non-ref (2)
// depth rev .. ref (1) and non-ref (3)
// baseQ .. ref (4) and non-ref (6)
// baseQ^2 .. ref (5) and non-ref (7)
// mapQ .. ref (8) and non-ref (10)
// mapQ^2 .. ref (9) and non-ref (11)
// minDist .. ref (12) and non-ref (14)
// minDist^2 .. ref (13) and non-ref (15)
Is baseQ the sequencer quality, and mapQ the BAQ as described in your link? I don't have a guess for minDist...
It's possible I'm on a fool's errand. Maybe mpileup is too tightly coupled to bcftools for it to make sense to try and extract it for an independent implementation?
All the answers can be found in the code. For example, minDist is used here https://github.com/samtools/bcftools/blob/5d57c7b32a089fa4f3676f032198351a3ed7dbca/bam2bcf.c#L296
I don't know enough about your project to help you decide if it's worthwhile or not. Certainly the code is complex and reimplementing it will not be an easy task.
Thank you!
We're working on a GPU-accelerated mpileup implementation. We have a generic algorithm working, and now we want to do actual variant calling on the results.
As an intermediate step, we would like to generate a VCF in the same format as
bcftools mpileup
, that we can pipe tobcftools call
. This will let us leverage bcftools to verify our pileup engine.I'm having trouble finding detailed information about the algorithms and data format output by
bcftools mpileup
. The output VCF header pointed me to the source code, but after spending a couple hours studying it, I think it would take quite a bit of time before I'm confident I understand it. I also spent some time searching and skimming papers and no luck so far.Are there any other resources you could point me to?