Closed RGCheek closed 3 years ago
Their is more info on how it's read and modified in radiator::read_vcf
stacks VCFs: with de novo approaches, the CHROM column is filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and the COL column is POS -1. With a reference genome, the ID column in stacks VCF is separated into "LOCUS", "COL", "STRANDS".
Ok so let's look at these VCF columns #CHROM POS ID REF ALT
in the VCF you provided:
Line 22:
scaffold_0 714225 59:24:- G A
Inside radiator whitelist you have this:
VARIANT_ID CHROM LOCUS COL STRANDS POS MARKERS REF ALT M_SEQ
7 scaffold_0 59 24 - 714225 scaffold_0__59__714225 G A 7
Unless I'm mistaken, inside stacks VCF files using a reference genome the 59:24:-
is for LOCUS COL STRANDS
, if it's actually POS
inside the VCF ID
column and that to get the COL
info I have to modify it like de novo data with POS-1
I'll do it. I'll need to confirm this with Julian Catchen (stacks) ...
Could you send me the sumstats file generated by stacks from the same run it produced the VCF you're using ? Or check the corresponding fields ?
Hi Theirry,
Thank you for your help on this!
Could you send me the sumstats file generated by stacks from the same run it produced the VCF you're using ?
Attached are the sumstats files from Stacks:Populations using the raw whitelist from radiator and the whitelist I edited by subtracting 1 from the COL column.
sumstats_issj.zip
"I have to modify it like de novo data with POS-1"
I'm confused. To use a whitelist in Stacks:Populations (what I'm doing) you only include the LOCUS
COL to generate a SNP specific whitelist, correct? https://catchenlab.life.illinois.edu/stacks/manual/index.php#wl
Or check the corresponding fields ?
in the Vcf and whitelists?
Thanks!
Hi Thierry,
Describe the bug I've been trying to use the whitelist from Radiator to calculate nucleotide diversity using populations from Stacks. Populations reads in the correct number of loci, but only 35 variant sites instead of the expected 3409.
I tried doing a cursory filtering in Plink to produce a different whitelist and it seems like the COL values provided by radiator is off by 1.
When I made a new whitelist from the radiator output and subtracted 1 from each of the COL values it showed the correct number of variant sites in populations.
To Reproduce I sent you the vcf file and strata for issue #102. Let me know if you need me to re-send them.
I included a .txt file with the log info from filter_rad in the zip. As well as the whitelist from Plink and Radiator, and my "fixed" whitelist from Radiator.
sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 purrr_0.3.4 SeqArray_1.29.3 [11] gdsfmt_1.24.1 radiator_1.1.9
loaded via a namespace (and not attached): [1] colorspace_2.0-0 seqinr_4.2-4 deldir_0.1-29 ellipsis_0.3.1 class_7.3-17 XVector_0.28.0 fs_1.5.0
[8] GenomicRanges_1.40.0 rstudioapi_0.13 mice_3.12.0 furrr_0.2.1 listenv_0.8.0 farver_2.0.3 lubridate_1.7.9.2
[15] fansi_0.4.1 xml2_1.3.2 codetools_0.2-16 splines_4.0.2 ade4_1.7-15 jsonlite_1.7.1 broom_0.7.2
[22] dbplyr_2.0.0 cluster_2.1.0 shiny_1.5.0 httr_1.4.2 BiocManager_1.30.10 compiler_4.0.2 backports_1.2.0
[29] assertthat_0.2.1 Matrix_1.2-18 fastmap_1.0.1 cli_2.2.0 later_1.1.0.1 htmltools_0.5.0 tools_4.0.2
[36] igraph_1.2.6 coda_0.19-4 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.3 reshape2_1.4.4 gmodels_2.18.1
[43] Rcpp_1.0.5 cellranger_1.1.0 raster_3.3-13 vctrs_0.3.5 Biostrings_2.56.0 spdep_1.1-5 gdata_2.18.0
[50] ape_5.4-1 nlme_3.1-149 progressr_0.6.0 globals_0.14.0 rvest_0.3.6 mime_0.9 lifecycle_0.2.0
[57] gtools_3.8.2 future_1.20.1 SNPRelate_1.22.0 LearnBayes_2.15.1 zlibbioc_1.34.0 MASS_7.3-53 scales_1.1.1
[64] hms_0.5.3 promises_1.1.1 parallel_4.0.2 geodist_0.0.6 expm_0.999-5 HardyWeinberg_1.6.8 gridExtra_2.3
[71] UpSetR_1.4.0 stringi_1.5.3 S4Vectors_0.26.1 e1071_1.7-3 permute_0.9-5 BiocGenerics_0.34.0 boot_1.3-25
[78] truncnorm_1.0-8 spData_0.3.8 GenomeInfoDb_1.24.2 rlang_0.4.9 pkgconfig_2.0.3 bitops_1.0-6 Rsolnp_1.16
[85] lattice_0.20-41 sf_0.9-6 labeling_0.4.2 tidyselect_1.1.0 parallelly_1.21.0 plyr_1.8.6 magrittr_2.0.1
[92] R6_2.5.0 IRanges_2.22.2 generics_0.1.0 DBI_1.1.0 withr_2.3.0 haven_2.3.1 pillar_1.4.7
[99] carrier_0.1.0 mgcv_1.8-33 units_0.6-7 RCurl_1.98-1.2 sp_1.4-4 modelr_0.1.8 crayon_1.3.4
[106] KernSmooth_2.23-17 readxl_1.3.1 adegenet_2.1.3 grid_4.0.2 data.table_1.13.2 vegan_2.5-7 reprex_0.3.0
[113] digest_0.6.27 classInt_0.4-3 xtable_1.8-4 httpuv_1.5.4 fst_0.9.4 stats4_4.0.2 munsell_0.5.0
[120] viridisLite_0.3.0
radiator_issue_data.zip