thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
58 stars 23 forks source link

Whitelist indexing off by 1 in COL #108

Closed RGCheek closed 3 years ago

RGCheek commented 3 years ago

Hi Thierry,

Describe the bug I've been trying to use the whitelist from Radiator to calculate nucleotide diversity using populations from Stacks. Populations reads in the correct number of loci, but only 35 variant sites instead of the expected 3409.

I tried doing a cursory filtering in Plink to produce a different whitelist and it seems like the COL values provided by radiator is off by 1.

When I made a new whitelist from the radiator output and subtracted 1 from each of the COL values it showed the correct number of variant sites in populations.

To Reproduce I sent you the vcf file and strata for issue #102. Let me know if you need me to re-send them.

I included a .txt file with the log info from filter_rad in the zip. As well as the whitelist from Plink and Radiator, and my "fixed" whitelist from Radiator.

sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 purrr_0.3.4 SeqArray_1.29.3 [11] gdsfmt_1.24.1 radiator_1.1.9

loaded via a namespace (and not attached): [1] colorspace_2.0-0 seqinr_4.2-4 deldir_0.1-29 ellipsis_0.3.1 class_7.3-17 XVector_0.28.0 fs_1.5.0
[8] GenomicRanges_1.40.0 rstudioapi_0.13 mice_3.12.0 furrr_0.2.1 listenv_0.8.0 farver_2.0.3 lubridate_1.7.9.2
[15] fansi_0.4.1 xml2_1.3.2 codetools_0.2-16 splines_4.0.2 ade4_1.7-15 jsonlite_1.7.1 broom_0.7.2
[22] dbplyr_2.0.0 cluster_2.1.0 shiny_1.5.0 httr_1.4.2 BiocManager_1.30.10 compiler_4.0.2 backports_1.2.0
[29] assertthat_0.2.1 Matrix_1.2-18 fastmap_1.0.1 cli_2.2.0 later_1.1.0.1 htmltools_0.5.0 tools_4.0.2
[36] igraph_1.2.6 coda_0.19-4 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.3 reshape2_1.4.4 gmodels_2.18.1
[43] Rcpp_1.0.5 cellranger_1.1.0 raster_3.3-13 vctrs_0.3.5 Biostrings_2.56.0 spdep_1.1-5 gdata_2.18.0
[50] ape_5.4-1 nlme_3.1-149 progressr_0.6.0 globals_0.14.0 rvest_0.3.6 mime_0.9 lifecycle_0.2.0
[57] gtools_3.8.2 future_1.20.1 SNPRelate_1.22.0 LearnBayes_2.15.1 zlibbioc_1.34.0 MASS_7.3-53 scales_1.1.1
[64] hms_0.5.3 promises_1.1.1 parallel_4.0.2 geodist_0.0.6 expm_0.999-5 HardyWeinberg_1.6.8 gridExtra_2.3
[71] UpSetR_1.4.0 stringi_1.5.3 S4Vectors_0.26.1 e1071_1.7-3 permute_0.9-5 BiocGenerics_0.34.0 boot_1.3-25
[78] truncnorm_1.0-8 spData_0.3.8 GenomeInfoDb_1.24.2 rlang_0.4.9 pkgconfig_2.0.3 bitops_1.0-6 Rsolnp_1.16
[85] lattice_0.20-41 sf_0.9-6 labeling_0.4.2 tidyselect_1.1.0 parallelly_1.21.0 plyr_1.8.6 magrittr_2.0.1
[92] R6_2.5.0 IRanges_2.22.2 generics_0.1.0 DBI_1.1.0 withr_2.3.0 haven_2.3.1 pillar_1.4.7
[99] carrier_0.1.0 mgcv_1.8-33 units_0.6-7 RCurl_1.98-1.2 sp_1.4-4 modelr_0.1.8 crayon_1.3.4
[106] KernSmooth_2.23-17 readxl_1.3.1 adegenet_2.1.3 grid_4.0.2 data.table_1.13.2 vegan_2.5-7 reprex_0.3.0
[113] digest_0.6.27 classInt_0.4-3 xtable_1.8-4 httpuv_1.5.4 fst_0.9.4 stats4_4.0.2 munsell_0.5.0
[120] viridisLite_0.3.0

radiator_issue_data.zip

thierrygosselin commented 3 years ago

Their is more info on how it's read and modified in radiator::read_vcf

stacks VCFs: with de novo approaches, the CHROM column is filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and the COL column is POS -1. With a reference genome, the ID column in stacks VCF is separated into "LOCUS", "COL", "STRANDS".

thierrygosselin commented 3 years ago

Ok so let's look at these VCF columns #CHROM POS ID REF ALT in the VCF you provided:

Line 22:

scaffold_0  714225  59:24:- G   A

Inside radiator whitelist you have this:

VARIANT_ID  CHROM   LOCUS   COL STRANDS POS MARKERS REF ALT M_SEQ
7   scaffold_0  59  24  -   714225  scaffold_0__59__714225  G   A   7

Unless I'm mistaken, inside stacks VCF files using a reference genome the 59:24:- is for LOCUS COL STRANDS, if it's actually POS inside the VCF ID column and that to get the COL info I have to modify it like de novo data with POS-1 I'll do it. I'll need to confirm this with Julian Catchen (stacks) ...

thierrygosselin commented 3 years ago

Could you send me the sumstats file generated by stacks from the same run it produced the VCF you're using ? Or check the corresponding fields ?

thierrygosselin commented 3 years ago

Early 2020 stacks google group discussion on this

RGCheek commented 3 years ago

Hi Theirry,

Thank you for your help on this!

Could you send me the sumstats file generated by stacks from the same run it produced the VCF you're using ?

Attached are the sumstats files from Stacks:Populations using the raw whitelist from radiator and the whitelist I edited by subtracting 1 from the COL column.
sumstats_issj.zip

"I have to modify it like de novo data with POS-1"

I'm confused. To use a whitelist in Stacks:Populations (what I'm doing) you only include the LOCUS COL to generate a SNP specific whitelist, correct? https://catchenlab.life.illinois.edu/stacks/manual/index.php#wl

Or check the corresponding fields ?

in the Vcf and whitelists?

Thanks!