samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Counting ALT allele instances #1584

Closed 7PintsOfCherryGarcia closed 1 year ago

7PintsOfCherryGarcia commented 1 year ago

For any given record, I want to obtain the ALT allele count per sample. So for a record such as:

GT:PL 0/0:0,9,42 0/1:20,0,20 1/1:50,12,0

I need to extract the following counts:

0 1 2

In essence I want to extract the genotype matrix.

Based on the vcf.h header example I know I can extract the genotypes from the FORMAT field with:

ngt = bcf_get_genotypes(hdr, rec, &gt_arr, &ngt_arr);

My problem is that I am not sure exactly with what information the gt_arr array is being filled with. I know I can get the index of the corresponding base with bcf_gt_allele(gt_arr[]) and that and index of 0 corresponds to REF. So the following code will count the number of ALT instances in nalt without distinguishing between ALT bases in ploidies > 2 and assuming missing data as REF:

        for ( i = 0; i < nsamples; i++) {
            ptr = gt_arr + i*max_ploidy;
            nalt = 0;
            for (j = 0; j < max_ploidy; j++) {
                if ( ptr[j]==bcf_int32_vector_end ) break;
                if ( bcf_gt_is_missing(ptr[j]) ) continue;
                nalt += !!bcf_gt_allele(ptr[j]);
            }
        }

Is there any other way of extracting ALT counts I am not aware of? Thanks for any help.

pd3 commented 1 year ago

If you want to parse the FORMAT/GT field and extract the genotypes, that's exactly how it should be done.

The meaning of the gt_arr values can be understood by looking at these macros https://github.com/samtools/htslib/blob/226c1a813bc5d0582f7e0b0bdb4b3ea9e3ee4ce4/htslib/vcf.h#L892-L900 and the BCF part of the VCF specification: it's just the allele indices shifted by one to be able to mark phased genotypes.