knausb / vcfR

Tools to work with variant call format files
246 stars 55 forks source link

is.polymorphic bug, returns FALSE for a sample of all hets? #178

Closed JimWhiting91 closed 3 years ago

JimWhiting91 commented 3 years ago

Hi,

I came across some unexpected behaviour whilst running a recent analysis. In my analysis, I'm using vcfR's is.polymorphic() command to tell me which SNPs in two population-subsampled VCFs are polymorphic within each of those populations. I'm then comparing those vectors to find private polymorphisms within each population.

However when doing this I've noticed that is.polymorphic returns FALSE for SNPs where all genotypes in a population are 0/1, which seems like a bug. I've included a VCF with a single SNP from my data that highlights this issue. Code to reproduce is:

library(vcfR)
test_vcf <- read.vcfR("~/Downloads/is.polymorphic_bug.vcf.gz")
extract.gt(test_vcf)
is.polymorphic(test_vcf)

is.polymorphic_bug.vcf.gz

Is there any reason why this would be happening and I've misunderstood something?

And here's my sessionInfo

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pegas_0.14        adegenet_2.1.3    ade4_1.7-16       ape_5.4-1         vcfR_1.12.0      
 [6] viridis_0.5.1     viridisLite_0.3.0 data.table_1.13.2 ggplot2_3.3.2     cowplot_1.1.0    

loaded via a namespace (and not attached):
 [1] splines_4.0.3      gtools_3.8.2       shiny_1.5.0        expm_0.999-5       memuse_4.1-0      
 [6] sp_1.4-4           LearnBayes_2.15.1  progress_1.2.2     pillar_1.4.7       lattice_0.20-41   
[11] glue_1.4.2         digest_0.6.27      promises_1.1.1     colorspace_2.0-0   htmltools_0.5.0   
[16] httpuv_1.5.4       Matrix_1.2-18      plyr_1.8.6         pkgconfig_2.0.3    raster_3.4-5      
[21] gmodels_2.18.1     purrr_0.3.4        xtable_1.8-4       scales_1.1.1       gdata_2.18.0      
[26] later_1.1.0.1      tibble_3.0.4       mgcv_1.8-33        generics_0.1.0     ellipsis_0.3.1    
[31] withr_2.3.0        magrittr_2.0.1     crayon_1.3.4       mime_0.9           deldir_0.2-3      
[36] nlme_3.1-149       MASS_7.3-53        class_7.3-17       vegan_2.5-7        tools_4.0.3       
[41] prettyunits_1.1.1  hms_0.5.3          lifecycle_0.2.0    stringr_1.4.0      munsell_0.5.0     
[46] cluster_2.1.0      packrat_0.5.0      compiler_4.0.3     e1071_1.7-4        rlang_0.4.9       
[51] classInt_0.4-3     units_0.6-7        grid_4.0.3         rstudioapi_0.13    igraph_1.2.6      
[56] boot_1.3-25        gtable_0.3.0       codetools_0.2-16   DBI_1.1.0          reshape2_1.4.4    
[61] R6_2.5.0           gridExtra_2.3      dplyr_1.0.2        pinfsc50_1.2.0     fastmap_1.0.1     
[66] seqinr_4.2-4       spdep_1.1-5        KernSmooth_2.23-17 permute_0.9-5      stringi_1.5.3     
[71] Rcpp_1.0.5         vctrs_0.3.5        sf_0.9-6           spData_0.3.8       tidyselect_1.1.0  
[76] coda_0.19-4     

Thanks for your work on this package!

Cheers, Jim

knausb commented 3 years ago

Hi @JimWhiting91 , I think we may have a difference in perspective here. If all genotypes are 0/1 then there is no polymorphism for genotypes. This is important because this variant may not contribute to an analysis where you are looking for genotypes that may not contribute to an analysis looking for differences among populations. This means you may omit it from analysis and may save yourself a little compute time.

Is there something specific you're trying to accomplish? Or have you just run into something you feel is odd? I consider this to be expected behavior. If you can provide me with a compelling argument for why alternate behavior may be more appropriate I'd be willing to entertain it.

Thanks! Brian

JimWhiting91 commented 3 years ago

Hi Brian,

Thanks for the quick response! I see what you're saying, but for me this would not be expected. I would regard a polymorphic site as any site in which more than one allele is observed within the population, so by the information given for the function I would expect these sites to stay in.

I think this is probably quite important for various analyses. As I said in the simple instances of wanting to know which sites are polymorphic vs invariant within a population, I think the general consensus would be that all het sites are polymorphic in the sense that both alleles are present in a population, or rather that an allele isn't absent for a specific reason (for instance when your vcf includes populations from different lineages and you expect some lineage-specific variants). These sites are also important for any assessment of diversity. The contribution of a perfectly heterozygous site towards pi or dxy (even if the site is perfectly heterozygous in both populations for dxy) is very different than the contribution of invariants (which as you say would be zero). For instance, a site at a frequency of 0.5 in pop1 and pop2 contributes equally towards dxy as a site that is fixed in one population and fully heterozygous in the other (eg. freq of 0 in pop1, freq of 0.5 in pop2). So if were to subset my vcf for a population, is.polymorphic at the moment will downwardly bias estimates of pi or dxy by removing fully heterozygous sites.

I would also argue that removing these sites when looking to compare populations for genetic differentiation would also be incorrect, even if the difference between populations is 0 (for eg. if we were just calculating the allele frequency change between two populations, both of which are perfectly heterozygous). Even though this value is 0, that 0 does still mean something, and that 0 is interesting in the context of "no change despite change being possible given the presence of both alleles". This is particularly true for studies of balancing selection for example where exaggerated heterozygosity is the focal signal of interest. I think these 0 values are therefore important to keep in, and are quite different to cases where both populations are invariant (for instance if a lineage-specific site exists in the VCF, but is not polymorphic in the focal lineage), where the context is "no change exists because the absence of the allele negates it".

I'm sure there are probably other reasons, but these are the sorts of the analyses I'm most familiar coming from a population genetics perspective. Ideally for me the behaviour of is.polymorphic would be to keep fully heterozygous sites by default, with an optional flag to remove these if desired for speed reasons.

Hopefully this is compelling, but if you've got any thoughts on the above I'd be happy to discuss further.

Cheers! Jim

knausb commented 3 years ago

Hi Jim,

I'm afraid I do not follow you. Perhaps some form of minimal reproducible example would help?

The reason I wrote this function was to help remove sites that are not polymorphic in terms of genotypes. Most analyses that create a distance matrix will treat these sites as uninformative so we may as well get rid of them. In this context I feel the function does demonstrate expected behavior.

I don't follow your comment on pi. Recall that VCF data only includes variable sites. Sites where all samples were 0/0 are not reported. So if you're really interested in allele frequencies, I feel you need to account for all of those alleles that were not reported in the VCF file. Correct? I think this applies to your comment about invariant sites being informative because there was a potential for change. VCF data does not include all sites so you will never have this information in a VCF file.

It sounds to me like you're looking for a way of finding sites that have more than 1 allele. I think it would be pretty easy to cook u a function that does that. I don't know how it will perform. You also may be able to address this with a combination of is.polymorphic() and is_het().

Thanks! Brian

JimWhiting91 commented 3 years ago

Hi Brian,

I can work on putting together something reproducible that highlights these.

I understand what you're saying about needing the monomorphic/invariant sites in order to accurately estimate pi and dxy. However the instances I was referring to are when people estimate pi/dxy for their polymorphic sites in their VCFs, sum these together and then divide through by the total size of the region (let's say 10kb windows) assuming that sites that are not in the VCF are invariant and so contribute 0. I know this is not a perfect estimate due to variable mapping rates across the genome, but it is fairly common practise to get an estimate.

So a workflow for instance might look like this: 1) I read in my total VCF with 10 populations into R with vcfR, 2) subsample for each population, 3) remove invariants with is.polymorphic (to speed up my calculations of per-site pi), 4) calculate per-site pi, 5) group my SNPs into 10kb intervals according to genome positions, 6) Average over each 10kb region by summing polymorphic site values and dividing through by 10,000 to get estimates of pi for each 10kb window.

The problem in the above workflow is that I've now over-filtered my per-pop VCFs at step 3, and removed heterozygous variants that otherwise would contribute towards regional pi at step 6. Again I know this is not a perfect estimate anyway, but it seems undesirable for it to compound the problem. Particularly when it is not obvious that these sites are being removed if someone is taking the definition of a polymorphic site at the allele level, which I think would be the standard assumption (although I can't speak for everybody!).

A similar workflow for window-based Fst might be: 1) I read in my total VCF with 10 populations into R with vcfR, 2) subsample for population pairs, 3) remove invariants with is.polymorphic (because I do not want values of Fst=0 when a site cannot vary between my focal populations, e.g. if it is private to any of the other 8), 4) calculate per-site Fst, 5) group my SNPs into 10kb intervals; 6) either calculate the average Fst per window, or look for windows with excessive proportions of high-Fst SNPs.

The problem in the workflow above is that it's important to remove the invariants when calculating the Fst, because we don't want our averaged Fst values to be downwardly biased by lots of zeroes from otherwise lineage-specific variants that are in our VCF due to our population sampling. Similarly if we're looking for windows with excessive proportions of high Fst SNPs, we don't want to be including invariant SNPs when estimating those proportions because they inflate the denominator. We do however want to include 0 values that come from fully heterozygous sites, because these are true 0 values reflecting no differentiation, but under the workflow above we are currently omitting them (based on an allele-definition of polymorphism).

You're right that a combination of is.polymorphic and is.het can produce the output I am interested in, for eg:

# Produce invariant-filtered with and without hets
vcf_with_hets <- vcf_sub[is.polymorphic(vcf_sub,na.omit = T) | 
                          rowSums(is.het(extract.gt(vcf_sub))) == (ncol(vcf_sub@gt)-1), ]
vcf_without_hets <- vcf_sub[is.polymorphic(vcf_sub,na.omit = T), ]

I guess my main concern is that vcfR is currently behaving in a way that is potentially ambiguous, and there isn't documentation available that makes it clear. Even adding a flag to is.polymorphic along the lines of keep_hets=TRUE would highlight that there are two interpretations to this filtering. I also think that given the proportion of invariant sites that are currently being filtered that are also fully heterozygous will be pretty low, it's probably safer to keep these sites by default. That is, we'll get the majority of the benefits to speed from moving the vast majority of invariants, without removing these additional sites that, in some circumstances, will affect some analyses.

Apologies if this is coming across as incoherent rambling, just trying to make my examples clearer!

Jim

knausb commented 3 years ago

Hi Jim,

I'm not sure we're getting closer here. I feel you have specific analysis interests, but I'm wondering if we've already addressed methods similar to what you're looking for, even if they're not exactly what you're looking for. Please consider the following.

Thanks! Brian

JimWhiting91 commented 3 years ago

Hi Brian,

I think maybe I've overcomplicated things. I am not interested in a specific biological question here, my main concern is that I believe the documentation for is.polymorphic() is unclear, given the ambiguity that clearly exists given we are having this discussion. In particular:

The function is_polymorphic returns a vector of logicals indicating whether a variant is polymorphic. Only variable sites are reported in vcf files. However, once someone manipulates a vcfR object, a site may become invariant. For example, if a sample is removed it may result in a site becoming invariant. This function queries the sites in a vcfR object and returns a vector of logicals (TRUE/FALSE) to indicate if they are actually variable.

A site that is sequenced as heterozygous in all individuals in a population will be reported as a variable site in a VCF. It varies at the sequence level, and demonstrates that two alleles exist within the population. Many would consider this the definition of a polymorphism. I appreciate there is no variation among genotypes, but everyone I have spoken with would regard these sites as polymorphic. Therefore, given the documentation, I would regard the removal of these sites to be unexpected.

This would not be an issue if removing these sites did not affect analyses, as you mentioned earlier:

This is important because this variant may not contribute to an analysis where you are looking for genotypes that may not contribute to an analysis looking for differences among populations. This means you may omit it from analysis and may save yourself a little compute time.

However I do not understand this to be the case. In the attached example, I have taken 19 individuals from my dataset at a small region with 158 variants. Of these, 49 are invariant. I have then forcibly introduced 20 fully heterozygous genotypes to the first 20 SNPs. After this, I can filter the VCF with is.polymorphic() to leave 96 "polymorphic" sites. I have also filtered the VCF as I described in my previous comment to remove invariants but retain the fully heterozygous sites (so 96 sites + 20 fully heterozygous sites). I have then used general vcfR functionality as outlined in your excellent tutorials to show that the outputs from the filtered VCF with full heterozygotes retained produces identical results for Tajima's D, Nei's Gst, Fst and Fis as the full, unfiltered VCF. In contrast, the VCF filtered for invariants just with is.polymorphic() produces reduced Tajima's D and inflated Gst, Fst, and Fis.

From this then, I would argue that you cannot omit these variants from your analysis to save compute time without affecting your results. In contrast, if you retain fully heterozygous sites, and only filter true invariants, you can.

These again are just basic examples, but I expect this effect would apply to other general analyses. I cannot demonstrate them all, but hopefully I have convinced you that these sites can have an effect, and if they can under certain circumstances then I would argue default behaviour should be to retain them. I have found vcfR to be tremendously useful within my workflows and it has become a keystone package for much of my recent work across many diverse and different analyses.

As you say, the functionality already exists in that I can retain these if I know they are being removed by is.polymorphic(), but the point I am trying to emphasise is that I don't think the documentation implies that fully heterozygous sites are removed by default. This is behaviour that I came across by accident, after assuming that is.polymorphic() would retain what I understood to be polymorphic sites based on the documentation. I'd be happy to poll Twitter and see what other's think if you are still unconvinced that there's ambiguity here?

Please find the example data and code below, and thanks for bearing with me!

Cheers, Jim

vcfR_ispolymorphic_example.zip

knausb commented 3 years ago

This appears to be a simple misunderstanding on what "polymorphism" means. For example, wikipedia's polymorphism page states that it is two or more forms. If all your genotypes are identical then the site is not polymorphic. There appears to be no actionable item here.

JimWhiting91 commented 3 years ago

Hi Brian, I don't think that general definition applies to the term polymorphism within genetics, as laid out in the specific Wikipedia entry for gene polymorphism which refers to alleles, not genotypes https://en.m.wikipedia.org/wiki/Gene_polymorphism .

All the best, Jim

knausb commented 3 years ago

Changing the behavior of is.polymorphic() would break every existing line of code that already calls it. Including the code I use to do my job. And at >100k downloads I have no idea how much of other people's code it will break. At this point I may as well have named the function thingy2() but I would still be against renaming it. Instead of working to break other people's code I think we should work on positive things.