knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

Variants with MAF of 0 after removing monomorphics #192

Closed Neato-Nick closed 2 years ago

Neato-Nick commented 2 years ago

Hi Brian!

I've got a strange problem I've just been banging my head against the wall about. I have a VCF that I believe I've removed monomorphic variants from. But when I check maf(), there are loads of variants with NA, Count, and Frequency all =0. So, out of 30,545 polymorphic variants, 7,845 have no alternate allele counted in the population. By definition, wouldn't these be monomorphic? I noticed write.vcf doesn't recode the INFO column and I was hoping that was the issue, but then I went through is.polymorphic and noted that it does directly check genotypes.

I was guessing it was caused by having a mix of phased and unphased variants so I messed with pinf_sc50 at monomorphic loci but couldn't reproduce the error. So to come up with a minimal example, I thinned my VCF to just two loci and am attaching it. All the genotypes at Phyram_PR-102_s0001:318468 are either 1/1 or 1|1, while Phyram_PR-102_s0001_318477 is truly polymorphic. VCF file: test.twovars.fill-an-ac.vcf.txt

In the code below, I expect my subsetting with is.polymorphic to remove all variants with a minor allele frequency of 0. For pinf_sc50, it works as expected, but for the data attached it does not. In my VCF, I re-calculated AC and AN in the INFO column with vcftools, it shows AC=AN=320. For a population of 160 diploids, that means they all have the alternate allele and therefore to me, the variant is monomorphic. But, if you have a different philosophy I get that! As a workaround, I think I just could filter variants out on the MAF table if they have a frequency of 0. I am concerned that this is somehow a symptom of another issue.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
vcf <- read.vcfR(vcf_file, verbose = FALSE)
vcf_minor <- data.frame(maf(vcf, element = 2))
nrow(filter(vcf_minor, Frequency == 0))
#> [1] 243
vcf_poly <- vcf[is.polymorphic(vcf)]
vcf_poly_minor <- data.frame(maf(vcf_poly))
nrow(filter(vcf_poly_minor, Frequency == 0))
#> [1] 0

# Selected only two variants, one is monomorphic and one is polymorphic
vcf_twovars <- read.vcfR("data/generated_data/test.twovars.fill-an-ac.vcf",
                         verbose = FALSE)
vcf_twovars_minor <- data.frame(maf(vcf_twovars, element = 2))
nrow(filter(vcf_twovars_minor, Frequency == 0))
#> [1] 1
vcf_twovars_poly <- vcf_twovars[is.polymorphic(vcf_twovars)]
vcf_twovars_poly_minor <- data.frame(maf(vcf_twovars_poly))
nrow(filter(vcf_twovars_poly_minor, Frequency == 0))
#> [1] 1

Created on 2021-09-22 by the reprex package (v2.0.0)

knausb commented 2 years ago

Hi @Neato-Nick ,

I'm not sure I follow your example, but I think the issue is in how you're using is.polymorphic().

# vcf[ is.polymorphic(vcf) ]
# Change to the following.
vcf[ is.polymorphic(vcf, na.omit = TRUE), ]

Without this difference the resulting gt slot is all NA. The is.polymorphic() function should return a vector that is the same length as the number of rows in the gt slot, which should be the number of variants. Without the comma to specify rows I suspect the vector may be recycled to populate the whole matrix? But I really don't know. The default na.omit=FALSE is there to remind you that you have missing data and by changing it to TRUE you are trying to manage this missing data.

Does this get the example closer to what you expect?

Neato-Nick commented 2 years ago

Ah, adding na.omit=TRUE and that comma does resolve some puzzling rows that were popping up in my maf output! But, it did not fix the issue. That pesky variant is still showing up despite having a minor allele frequency of 0. Here's the code for your is.polymorphic. Since "1/1" != "1|1", do you think test.poly would always indicate a variant with mixed phasing is polymorphic? I'm trying to wrap my head around it.

function (x, na.omit = FALSE) 
{
  if (!inherits(x, "vcfR")) {
    stop("Expected an object of class vcfR")
  }
  x <- extract.gt(x)
  test.poly <- function(x, na.omit = na.omit) {
    if (na.omit == TRUE) {
      x <- stats::na.omit(x)
    }
    sum(x[1] == x[-1]) < (length(x) - 1)
  }
  apply(x, MARGIN = 1, test.poly, na.omit = na.omit)
}
knausb commented 2 years ago

Hi @Neato-Nick ,

I think you've interpreted the situation correctly.

# Define function
test.poly <- function(x, na.omit = na.omit) {
  if (na.omit == TRUE) {
    x <- stats::na.omit(x)
  }
  sum(x[1] == x[-1]) < (length(x) - 1)
}
# Test function
test.poly(c("1/1", "1/1"), na.omit = TRUE)
#> [1] FALSE
test.poly(c("1|1", "1|1"), na.omit = TRUE)
#> [1] FALSE
test.poly(c("1/1", "1|1"), na.omit = TRUE)
#> [1] TRUE

Created on 2021-09-24 by the reprex package (v2.0.1)

As you stated, "1/1" != "1|1", I can see justification for this, perhaps strict, definition. And I can see room for disagreement as well. But separating the genotypes into alleles comes with a performance cost, and I'll usually vote to try to avoid that.

Perhaps this is not the issue you're searching for here? I feel you may be confusing two topics: whether a locus is polymorphic and what is the minor allele frequency. The allele frequency is independent of how these alleles may be organized into genotypes. For example, if a locus is fixed for a heterozygous genotype (say "0/1") the locus would not be polymorphic, but the minor allele frequency is equal to the major allele frequency (which may create other issues because the distinction between major and minor is arbitrary here, but I think that's another topic). Does this get you closer to where you'd like to be?

Neato-Nick commented 2 years ago

This is a great example of test.poly, thank you! Yep I totally agree, I think I was stuck because I couldn't wrap my head around the philosophy that a locus can be polymorphic despite only observing one allele in the population. As you said, it requires a strict definition of polymorphism.

I think it is a little bit different than your philosophical discussion of handling fixed heterozygous genotypes (https://github.com/knausb/vcfR/issues/178 was an interesting read), because this seems more like a technical distinction than biological. Biologically, 1/1 == 1|1. But technically, 1/1 != 1|1. We generally want the technology to be as close to biology as possible, though I think it's fair for our technology downstream to represent the ambiguity in GATK's technology upstream.

I don't want these cases to inflate my SNP count of a population. But also, removing monomorphic sites wouldn't really affect inferences of population diversity right? I don't think Gst would be effected. And the invariant sites don't add to the genetic distance between any samples. It does affect the denominator of hamming distance because that's a proportion of (n_genotypes_different / n_loci_in_vcf), but all the populations are plotted on the same scale anyway.

When I opened the issue, I didn't mean to propose this as a bug anyway so I'm happy with this discussion :)

knausb commented 2 years ago

I think this thread has wandered quite a bit from the original post. And this appears to be turning into a larger conversation. So I'm going to close this if that's fine with you.

Neato-Nick commented 2 years ago

Yep, fine with me. Moving forward I'll just use the MAF table to filter on mixed-phased variants output by GATK instead of using is.polymorphic.