At the end of the regenotyping pipeline, it would be very easy to dump the following
Some kind of summary/signature of all the snps/indels in the VCF (might just be md5)
for each sample, a JSON with two entries. One is a bitfield and one an integer array, each as long as the VCF has records (ie one bit/integer per record). In these we put:
for each record, set bit to 1 if genotype is either ./. or het
for each record, set integer to the (haploid) genotype.
Once stored at the end of regenotyping, this will make distance measuring trivial
Then at the end we can just "cat" all the bitarrays for ./. or het, and cat all the intvectors, and then the distance measurement is trivial:
dist=0
for i= 0 to number of records-1
for j= i to number of records-1
if the bitfield[i]==bitfield[j]==0 (meaning it is neither missing nor het)
if the int vector[i] != int vector [j]
dist++
At the end of the regenotyping pipeline, it would be very easy to dump the following
Then at the end we can just "cat" all the bitarrays for ./. or het, and cat all the intvectors, and then the distance measurement is trivial:
dist=0 for i= 0 to number of records-1 for j= i to number of records-1
if the bitfield[i]==bitfield[j]==0 (meaning it is neither missing nor het) if the int vector[i] != int vector [j] dist++
Ought to be v fast