samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
680 stars 240 forks source link

Trying to understand mpileup output format #2308

Closed anderspitman closed 1 week ago

anderspitman commented 1 month ago

We're working on a GPU-accelerated mpileup implementation. We have a generic algorithm working, and now we want to do actual variant calling on the results.

As an intermediate step, we would like to generate a VCF in the same format as bcftools mpileup, that we can pipe to bcftools call. This will let us leverage bcftools to verify our pileup engine.

I'm having trouble finding detailed information about the algorithms and data format output by bcftools mpileup. The output VCF header pointed me to the source code, but after spending a couple hours studying it, I think it would take quite a bit of time before I'm confident I understand it. I also spent some time searching and skimming papers and no luck so far.

Are there any other resources you could point me to?

pd3 commented 3 weeks ago

The mpileup program is complex and does lot of things. What specifically are you interested in? For example, the calculation of genotype likelihoods and BAQ calculation is described in http://samtools.github.io/bcftools/samtools.pdf

anderspitman commented 3 weeks ago

From a high level, a solid step for us would be to be able to follow the basic calling instructions here: https://samtools.github.io/bcftools/howtos/variant-calling.html

But using the output of our GPU mpileup engine instead of bcftools mpileup.

In order to accomplish that, I need to understand the output of mpileup. Most of the heavy lifting seems to be done by the I16 field, so I need to understand that at least. But there's probably more.

bam2bcf.h shows:

// The fields are:
//      depth fwd   .. ref (0) and non-ref (2)
//      depth rev   .. ref (1) and non-ref (3)
//      baseQ       .. ref (4) and non-ref (6)
//      baseQ^2     .. ref (5) and non-ref (7)
//      mapQ        .. ref (8) and non-ref (10)
//      mapQ^2      .. ref (9) and non-ref (11)
//      minDist     .. ref (12) and non-ref (14)
//      minDist^2   .. ref (13) and non-ref (15)

Is baseQ the sequencer quality, and mapQ the BAQ as described in your link? I don't have a guess for minDist...

It's possible I'm on a fool's errand. Maybe mpileup is too tightly coupled to bcftools for it to make sense to try and extract it for an independent implementation?

pd3 commented 2 weeks ago

All the answers can be found in the code. For example, minDist is used here https://github.com/samtools/bcftools/blob/5d57c7b32a089fa4f3676f032198351a3ed7dbca/bam2bcf.c#L296

I don't know enough about your project to help you decide if it's worthwhile or not. Certainly the code is complex and reimplementing it will not be an easy task.

anderspitman commented 1 week ago

Thank you!