Closed ab08028 closed 3 years ago
My suspicion is that cyvcf2 encodes haploid gt_types
in a way we didn't expect. I think the mutyper spectra
command should work on haploids if gt_types
is encoded as 0
-->REF
, 1
-->ALT
(but maybe it's not).
Can you print a couple variants' gt_types
field with something like this?
from cyvcf2 import VCF
for variant in VCF('haploids.vcf.gz'):
print(variant.gt_types)
break
oh interesting … so I did the above code and the results are coded in 0s and 3s for some reason… example: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0]
But in the original VCF and in the mutyper variants VCF the sites are coded with 0s and 1s. But the 1s are 3s now when cyvcf reads it in. From this page https://brentp.github.io/cyvcf2/ it looks like cyvcf uses 0,1,2,3 to refer to hom-ref, het, unknown, hom-alt? so if it converts the haploids 0 and 1s to hom-ref and hom-alt codes, then that would make sense why it’s counting them 2x?
It looks like this may cause issues with --randomize as well (randomize isn’t working with the haploid data, though mutyper spectra works fine without it)
I threw an example haploid mutyper variants file into the harris data dir (forWillFromAnnabel_exampleHaploidMutyperVariantsFile/testFileForWill.vcf.gz)
Hmm that looks like probably it! But shouldn't this give 3x inflation of alt allele counts, rather than 2x?
I would have guessed --randomize
would behave normally even with this 0/3 coding.
Thanks for the test file, I'll take a look.
My hunch is that internally it codes 3 (hom alt) as equaling 2 counts of the alt allele. Yeah not sure why randomize is affected by it either ... will copy the error message I get for randomize momentarily
hmm okay --randomize is working on my test file actually, so scratch that part of the issue!
Minor bug (?) -- when I run mutyper spectra on a mutyper variants file that has haploid calls (single 0s or 1s instead of 0/0) the spectra counts are 2x what they should be. I think it's maybe still counting them like diploids? Not a huge deal (could just divide by 2!) but probably easy to fix?