iqbal-lab-org / minos

Variant call adjudication
MIT License
16 stars 5 forks source link

Dump binary encoded genotypes after regenotyping #98

Open iqbal-lab opened 4 years ago

iqbal-lab commented 4 years ago

At the end of the regenotyping pipeline, it would be very easy to dump the following

  1. Some kind of summary/signature of all the snps/indels in the VCF (might just be md5)
  2. for each sample, a JSON with two entries. One is a bitfield and one an integer array, each as long as the VCF has records (ie one bit/integer per record). In these we put:
    • for each record, set bit to 1 if genotype is either ./. or het
    • for each record, set integer to the (haploid) genotype. Once stored at the end of regenotyping, this will make distance measuring trivial

Then at the end we can just "cat" all the bitarrays for ./. or het, and cat all the intvectors, and then the distance measurement is trivial:

dist=0 for i= 0 to number of records-1 for j= i to number of records-1

if the bitfield[i]==bitfield[j]==0 (meaning it is neither missing nor het) if the int vector[i] != int vector [j] dist++

Ought to be v fast

iqbal-lab commented 4 years ago

Note these could be merged as vcfs are merged

iqbal-lab commented 4 years ago

This just an idea for the future

iqbal-lab commented 4 years ago

maybe this is now redundant?