samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
656 stars 240 forks source link

Bcf info field documentation is confusing #462

Closed jkbonfield closed 8 years ago

jkbonfield commented 8 years ago

Generating a bcf file from bcftools inserts I16 field. The header states:

##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">

The code it refers to is:

typedef struct {
    uint32_t ori_depth;
    unsigned int mq0;
    int32_t *ADF, *ADR;
    float qsum[4];
    // The fields are:                                                                            
    //      depth fwd   .. ref (0) and non-ref (2)                                                
    //      depth rev   .. ref (1) and non-ref (3)                                                
    //      baseQ       .. ref (4) and non-ref (6)                                                
    //      baseQ^2     .. ref (5) and non-ref (7)                                                
    //      mapQ        .. ref (8) and non-ref (10)                                               
    //      mapQ^2      .. ref (9) and non-ref (11)                                               
    //      minDist     .. ref (12) and non-ref (14)                                              
    //      minDist^2   .. ref (13) and non-ref (15)                                              
    // Note that this probably needs a more thorough fix: int types in                            
    // bcf_call_t do overflow with high-coverage data, such as exomes, and                        
    // BCFv2 supports only floats which may not suffice.                                          
    double anno[16];
    float p[25];        // phred-scaled likelihood of each genotype                               
} bcf_callret1_t;

These appear to be filled out here:

        r->anno[1<<2|is_diff<<1|0] += baseQ;
        r->anno[1<<2|is_diff<<1|1] += baseQ * baseQ;
        r->anno[2<<2|is_diff<<1|0] += mapQ;
        r->anno[2<<2|is_diff<<1|1] += mapQ * mapQ;
        r->anno[3<<2|is_diff<<1|0] += min_dist;
        r->anno[3<<2|is_diff<<1|1] += min_dist * min_dist;

Yet looking at the lines, this just doesn't make sense.

ref(5) should be ref(4)^2 and ref(9) should be ref(8)^2. Yet an example VCF entry is:

1       10304   .       A       <*>     0       .       DP=2;I16=0,2,0,0,72,2592,0,0,25,325,0,0,50,1250,0,0;QS=1,0;MQ0F=0       PL      0,6,23

Here baseQ is 72, yet 72^2 is not 2592. mapQ is 25, but 25^2 is not 325. How are these computed? It looks like they should be correct based on the code, but I assume something else is subsequently modifying these values.

(I'm only interested because one of my speedups has modified one of these by 1, possibly just a minor rounding difference, but I'm trying to understand if it matters.)

Ideally these should be in the bcftools man page too. Asking people to read the code is rather unfriendly. Does anything downstream read these fields, or are they simply debugging data?

Edit: I spotted code in ccall.c and mcall.c that uses I16 (anno16[]), but only the first 4 elements.

pd3 commented 8 years ago

Hi James, these fields are internal to bcftools calling and may change, so I guess that's the reason why they weren't properly documented. All the fields are sums, including Q^2, so the values do make sense: in your example there are two reads. If both have base quality of 36, then Qsum=72 and Q^2sum=2592.