thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
58 stars 23 forks source link

filter_hwe not blacklisting loci that meet exclusion thresholds #89

Closed BrennaF closed 4 years ago

BrennaF commented 4 years ago

Hi Thierry! I hope all is well with you! I think I've found a bug in filter_hwe. See what you think.

Describe the bug filter_hwe is not blacklisting loci that meet the exclusion thresholds. It seems that many of the loci incorrectly making it through the blacklist filter have heterzygote excess within sites (e.g. all individuals at a site are genotyped as heterozygotes).

I ran the same data set in plink to calculate hwd + used their calculator (https://www.cog-genomics.org/software/stats) and was able to confirm that the loci in question should have been removed given the population/p-value thresholds in radiator.

To Reproduce filter_hwe(data="D:/0-RALU/01-HWE/ralu_subset.vcf", strata="D:/0-RALU/00-filter_vcf_code/RALUstrata.txt", interactive.filter=F, hw.pop.threshold = 2, midp.threshold = 1, parallel.core = parallel::detectCores() - 7)

(I get the same results with interactive.filter=T & setting the threshold interactively).

vcf and strata file HW_ralu.zip

output files 01_filter_hwe_20200724@1220.zip

For example: Look at "genotypes.summary.tsv" for the following loci: 1552112, 188588, 1138211. Many sites/strata have 12/12 individuals genotyped as hets. According to plink's calculations, those p-values are < 0.05, so should be blacklisted, but they are included in the whitelist.

Session Info

devtools::session_info()

  • Session info -------------------------------------------------------- setting value
    version R version 3.6.3 (2020-02-29) os Windows >= 8 x64
    system x86_64, mingw32
    ui RStudio
    language (EN)
    collate English_United States.1252
    ctype English_United States.1252
    tz America/New_York
    date 2020-07-24

[1] C:/Users/brf8-admin/Documents/R/win-library/3.6 [2] C:/Program Files/R/R-3.6.3/library

thierrygosselin commented 4 years ago

Hi Brenna !!!

Sorry for the long delay in response I was away for vacations in the middle of nowhere in the the gulf of St. Lawrence.

I'll have a look at it today.

Send me an email, for some reasons mine don't get to you? did you change address ? University ?

Cheers Thierry

BrennaF commented 4 years ago

No worries, glad to hear you're getting out for vacation!

I'll email you now...neither of mine have changed, so it is a bit strange; I have not been receiving emails from you, and you are clearly not getting mine!

thierrygosselin commented 4 years ago

running the current version of radiator: 1.1.6 (June 2020) I'm unable to reproduce the error.

I haven't tested 1.1.5

test1 <- radiator::filter_hwe(
  data="ralu_subset.vcf",
  strata="RALUstrata.txt")`
# ################################### RESULTS ####################################
# Filter HWE: 2 / 0.05
# Number of individuals / strata / chrom / locus / SNP:
# Before: 360 / 31 / 1 / 1000 / 1000
# Blacklisted: 0 / 0 / 0 / 96 / 96
# After: 360 / 31 / 1 / 904 / 904

length(unique(test1$MARKERS))

gives 904 markers and the blacklist in the folder does contain 96 markers.

test2 <- radiator::filter_hwe(
  data="ralu_subset.vcf",
  strata="RALUstrata.txt",
  interactive.filter = FALSE,
  hw.pop.threshold = 2,
  midp.threshold = 1)
length(unique(test2$MARKERS))

Same conclusion

Before updating your version wait for today's version, it fixes other bug introduce by deprecated functions from other packages.

If you're still experiencing a problem, re-open the issue! Thierry

BrennaF commented 4 years ago

Thanks for looking into this Thierry - I will check later next week to see why it was failing on my end. -Brenna

thierrygosselin commented 4 years ago

This bug was fixed a while ago, latest version will work.

FYI use lower thresholds for populations genomics otherwise you'll remove important markers. Use higher thresholds for parentage analysis.

It's at the end of filter_rad because low thresholds and the way it's built is still good at the population discovery step. Best Thierry