harrispopgen / mutyper

Ancestral k-mer mutation types for SNP data
https://harrispopgen.github.io/mutyper/
MIT License
7 stars 3 forks source link

haploid counting in mutyper spectra #21

Closed ab08028 closed 3 years ago

ab08028 commented 3 years ago

Minor bug (?) -- when I run mutyper spectra on a mutyper variants file that has haploid calls (single 0s or 1s instead of 0/0) the spectra counts are 2x what they should be. I think it's maybe still counting them like diploids? Not a huge deal (could just divide by 2!) but probably easy to fix?

wsdewitt commented 3 years ago

My suspicion is that cyvcf2 encodes haploid gt_types in a way we didn't expect. I think the mutyper spectra command should work on haploids if gt_types is encoded as 0-->REF, 1-->ALT (but maybe it's not).

Can you print a couple variants' gt_types field with something like this?

from cyvcf2 import VCF

for variant in VCF('haploids.vcf.gz'):
    print(variant.gt_types)
    break
ab08028 commented 3 years ago

oh interesting … so I did the above code and the results are coded in 0s and 3s for some reason… example: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0]

But in the original VCF and in the mutyper variants VCF the sites are coded with 0s and 1s. But the 1s are 3s now when cyvcf reads it in. From this page https://brentp.github.io/cyvcf2/ it looks like cyvcf uses 0,1,2,3 to refer to hom-ref, het, unknown, hom-alt? so if it converts the haploids 0 and 1s to hom-ref and hom-alt codes, then that would make sense why it’s counting them 2x?

It looks like this may cause issues with --randomize as well (randomize isn’t working with the haploid data, though mutyper spectra works fine without it)

ab08028 commented 3 years ago

I threw an example haploid mutyper variants file into the harris data dir (forWillFromAnnabel_exampleHaploidMutyperVariantsFile/testFileForWill.vcf.gz)

wsdewitt commented 3 years ago

Hmm that looks like probably it! But shouldn't this give 3x inflation of alt allele counts, rather than 2x?

I would have guessed --randomize would behave normally even with this 0/3 coding.

Thanks for the test file, I'll take a look.

ab08028 commented 3 years ago

My hunch is that internally it codes 3 (hom alt) as equaling 2 counts of the alt allele. Yeah not sure why randomize is affected by it either ... will copy the error message I get for randomize momentarily

ab08028 commented 3 years ago

hmm okay --randomize is working on my test file actually, so scratch that part of the issue!