harrispopgen / mutyper

Ancestral k-mer mutation types for SNP data
https://harrispopgen.github.io/mutyper/
MIT License
7 stars 3 forks source link

more mutyper counts in spectra than expected #30

Closed mrvollger closed 2 years ago

mrvollger commented 2 years ago

Hi Will,

I have a test vcf that when run generates more counts than I would expect. And I have hosted the vcf here so you can look at it: https://eichlerlab.gs.washington.edu/help/mvollger/share/mutyper-example/test.vcf

Looking at this vcf we can see that there are ~2000 variants (and it is a single sample VCF):

grep -v "^#" ~/public_html/share/mutyper-example/test.vcf | grep "1|.\|.|1" | cut -f 10 | sort | uniq -c

   29 0|1:0,1
    783 0|1:1,1
     28 1|0:0,1
      1 .|1:0,1
    769 1|0:1,1
      1 1|1:0,.
    258 1|1:0,2

However if I then count the number of variants in the spectra matrix I get > 20,000, but I would expect it to be at most ~2,000.

mutyper spectra ~/public_html/share/mutyper-example/test.vcf | tail -n +2 | sed 's/\t/\n/g' | grep -v "^HG\|^$" | datamash sum 1

20444

Can you tell if there is anything wrong with the VCF I have provided?

Thanks! Mitchell

mrvollger commented 2 years ago

Hi Will,

I think I am have some idea of what is happening. In this line the count is appended to using the gt_types object. https://github.com/harrispopgen/mutyper/blob/49ee4cce3a5c2b21155103fe063ec7ad4dad2203/mutyper/cli.py#L265 Which made sense to me until I looked at the cyvcf2 docs where they have this statment:

# gt_types is array of 0,1,2,3==HOM_REF, HET, UNKNOWN, HOM_ALT

So basically I think every unknown genotype is adding two to the mutation type and every homozygous alt is adding three instead of two.

I think I will work on a PR, unless I am missing some logic and this is the intended affect?

mrvollger commented 2 years ago

Oh I see you have gts012=True which according to the docs:

gts012 (bool) – if True, then gt_types will be 0=HOM_REF, 1=HET, 2=HOM_ALT, 3=UNKNOWN. If False, 3, 2 are flipped.

But I still think this will add 3 instead of zero if it is unknown which doesn't seem right.

mrvollger commented 2 years ago

Hi @WSDeWitt,

So I made a PR that I think fixes this if you get a chance to take a look. https://github.com/harrispopgen/mutyper/pull/31

Thanks, Mitchell

wsdewitt commented 2 years ago

Hi @mrvollger, nice catch and thanks for the PR. I agree that counting missing genotypes as 3 isn't a useful/expected behavior! However, I'm also a little uncertain that we want to count them as 0. I could imagine cases where samples differ drastically in callability patterns, so if we include variants that are uncalled in some samples, that could drive spurious mutation spectrum variation. For the mutyper ksfs command, missing genotypes raise an exception, since the allele frequency becomes undetermined (but see #12 for a possible alternative). I think the most conservative thing to do here would be to insist that the input VCF data has no missing genotypes, or at least log a warning if they are detected. What do you think?

BTW for filtering missing genotypes from VCF input to mutyper I've been doing:

bcftools view -g ^miss test.vcf | mutyper ...
mrvollger commented 2 years ago

I think a single warning on the first instance of missing data makes sense. Being more strict than that doesn't make too much sense to me since for any VCF with 1,000s of samples you might lose the vast majority of your data if you required every sample for every call to be present. Just looking at some data I have on hand based on 50 samples, if I require 100% of the samples to be genotyped then I eliminate 20% of the data compared to when I require 90% of the samples to be genotyped.

mrvollger commented 2 years ago

I updated the PR to spit out a warning the first time.

mrvollger commented 2 years ago

Would you mind cutting a new minor release for this? I'd like to update a pipeline. Thanks!

wsdewitt commented 2 years ago

Definitely! I'm going to add a unit tests for counts first

mrvollger commented 2 years ago

Hi Will,

I just compared the version on 0.6.1 of mutyper and it is still giving higher counts than my originally proposed PR: (https://github.com/mrvollger/mutyper/blob/72b00c05063b59d9d98278cee1781120813922bf/mutyper/cli.py#L262-L279)

The version on master/0.6.1:

mutyper spectra ../../test.vcf | tail -n +2 | sed 's/\t/\n/g' | grep -v "^HG\|^$" | datamash sum 1
[WARNING][Time elapsed (ms) 4]: Ambiguous genotypes found! Continuing by assuming reference genotypes for these variants.
4490

Version of the code with my PR

pip install git+https://github.com/mrvollger/mutyper.git && mutyper spectra test.vcf | tail -n +2 | sed 's/\t/\n/g' | grep -v "^HG\|^$" | datamash sum 1   
[WARNING][Time elapsed (ms) 4]: Ambiguous genotypes found! Continuing by assuming reference genotypes for these variants.
2135

Same test vcf as above, which I think we agreed should have about 2000 variants.

I am not sure what this issue is, but I wonder if there is some other cyvcf issue.

mrvollger commented 2 years ago

I am running a few more tests to make sure it is not my version that is buggy. Not sure yet.

wsdewitt commented 2 years ago

Hi Mitchell,

I just tried replacing the unphased genotypes (/) in our unit test input file with phased genotypes (|). I now get errors from pytest!

Now if we look into the genotype representations, the phase state is indicated with a boolean. So for the current unphased test file we have:

>>> vcf = cyvcf2.VCF("tests/test_data/snps.missing_gts.vcf")
>>> for variant in vcf:
...     print(variant.genotypes)
... 
[[0, 0, False], [1, 0, False]]
[[0, 1, False], [0, 0, False]]
[[1, -1, False], [1, 0, False]]
[[-1, 1, False], [0, 0, False]]
[[-1, -1, False], [1, 0, False]]

Whereas, if I replace with phased calls we get:

>>> vcf = cyvcf2.VCF("tests/test_data/snps.missing_gts.vcf")
>>> for variant in vcf:
...     print(variant.genotypes)
... 
[[0, 0, True], [1, 0, True]]
[[0, 1, True], [0, 0, True]]
[[1, -1, True], [1, 0, True]]
[[-1, 1, True], [0, 0, True]]
[[-1, -1, True], [1, 0, True]]

Now the trouble comes from this line: https://github.com/harrispopgen/mutyper/blob/74deb32cac5a072ea8901228fad341128cd7f456/mutyper/cli.py#L275 The logic here is to count the number of times we see 1 in the genotype representation. Of course, in python True == 1 holds, so for phased data we are also counting the phasing indicator! Gah!

wsdewitt commented 2 years ago

Can you try branch 30-phasing on your file?

mrvollger commented 2 years ago

Hi Will,

thanks for making this PR so quick! I tried on my file and now it matches my expected # of counts!

pip install git+https://github.com/harrispopgen/mutyper.git@30-phasing    && mutyper spectra test.vcf | tail -n +2 | sed 's/\t/\n/g' | grep -v "^HG\|^$" | datamash sum 1
...
[WARNING][Time elapsed (ms) 2]: Ambiguous genotypes found! Continuing by assuming reference genotypes for these variants.
2135

So I think I can approve the PR now, but I will look over a little more.

Thanks again!