harrispopgen / mutyper

Ancestral k-mer mutation types for SNP data
https://harrispopgen.github.io/mutyper/
MIT License
7 stars 3 forks source link

mutyper variants converts missing genotypes to 0 #36

Closed Lukez-pi closed 1 year ago

Lukez-pi commented 1 year ago

Hi,

I was using the variants command to annotate SNPs with their mutation types. In the case where the ancestral allele maps to the ALT allele, mutyper is supposed to flip the genotypes (0->1, 1->0) for that SNP in the VCF file. However I noticed that missing genotypes (.) will always get converted to 0 during this process. To illustrate the problem, I have the following fasta file:

>foo:chr1
AAACCCgggTTT

And the following VCF file

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248387328>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency in genotypes">
##INFO=<ID=mutation_type,Number=1,Type=Character,Description="ancestral 3-mer mutation type">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1 sample2
chr1    2       .       A       T       30      PASS    AN=3;AC=2;AF=0.67;mutation_type=AAG>ATG GT      1|.     1/0
chr1    3       .       C       A       30      PASS    AN=3;AC=1;AF=0.33;mutation_type=ACT>AAT GT      .|1     0/0
chr1    4       .       T       C       30      PASS    AN=2;AC=1;AF=0.50;mutation_type=ATT>ACT GT      .|.     1/0

This is the output (with header removed) after processing the VCF with mutyper variants --chrom_pos 1 $fasta_file $vcf_file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1 sample2
chr1    2       .       A       T       30      PASS    AN=3;AC=2;AF=0.67;mutation_type=AAA>ATA GT      1|.     1/0
chr1    3       .       A       C       30      PASS    AN=3;AC=2;AF=0.666667;mutation_type=AAC>ACC     GT      0|0     1/1
chr1    4       .       C       T       30      PASS    AN=2;AC=1;AF=0.5;mutation_type=ACC>ATC  GT      0|0     0/1

As you can see, both variants at position 3 and 4 had their REF and ALT allele flipped, and the missing genotype . was converted to a 0. I suspect the culprit is this line:

variant.genotypes = [
                    [int(not gt[0]), int(not gt[1]), gt[2]] for gt in variant.genotypes
                ]

cyvcf2 represents the missing genotype with -1, and as such, int(not -1)=0 .