Closed 7PintsOfCherryGarcia closed 1 year ago
If you want to parse the FORMAT/GT field and extract the genotypes, that's exactly how it should be done.
The meaning of the gt_arr
values can be understood by looking at these macros https://github.com/samtools/htslib/blob/226c1a813bc5d0582f7e0b0bdb4b3ea9e3ee4ce4/htslib/vcf.h#L892-L900 and the BCF part of the VCF specification: it's just the allele indices shifted by one to be able to mark phased genotypes.
For any given record, I want to obtain the ALT allele count per sample. So for a record such as:
GT:PL 0/0:0,9,42 0/1:20,0,20 1/1:50,12,0
I need to extract the following counts:
0 1 2
In essence I want to extract the genotype matrix.
Based on the vcf.h header example I know I can extract the genotypes from the FORMAT field with:
ngt = bcf_get_genotypes(hdr, rec, >_arr, &ngt_arr);
My problem is that I am not sure exactly with what information the
gt_arr
array is being filled with. I know I can get the index of the corresponding base withbcf_gt_allele(gt_arr[])
and that and index of 0 corresponds to REF. So the following code will count the number of ALT instances innalt
without distinguishing between ALT bases in ploidies > 2 and assuming missing data as REF:Is there any other way of extracting ALT counts I am not aware of? Thanks for any help.