thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
59 stars 23 forks source link

Error with filter_rad() - column doesn't exist #122

Closed bpbentley closed 3 years ago

bpbentley commented 3 years ago

Hi Thierry,

I'm running into issues with Radiator when running filter_rad() using a VCF file generated through Stacks. Additionally, my colleagues (who are running Radiator on MAC-OS) also run into the same error. We're just trying to filter these pilot data to inform further sampling and sequencing, so the number of samples per strata is low in some instances.

Describe the bug When running filter_rad(): Radiator runs through the filtering steps fine. When it gets to the genomic converter step, it generates an error that "Column GT doesn't exist".

To Reproduce This is the input that we are using for the filter_rad() step:

input_vcf<-("URO_snps.recode.vcf")
input_strata<-("URO_STRATA.tsv")

radiator::summary_strata(input_strata)

radiator_filtered_M <- filter_rad(
  data = input_vcf,
  filter.short.ld = NULL,
  filter.long.ld = 0.9,
  filter.hwe = FALSE,
  output = c("vcf","genind","genlight","plink","genepop","gtypes"),
  strata = input_strata,
  filename = "radiator_filtered_M",
  interactive.filter = FALSE,
  verbose = TRUE,
  parallel.core = 1L)

This is the full output that is generated:

################################################################################
############################# radiator::filter_rad #############################
################################################################################
Execution date@time: 20210304@1145
Folder created: filter_rad_20210304@1145
Function call and arguments stored in: radiator_filter_rad_args_20210304@1145.tsv
File written: random.seed (593467)
Filters parameters file generated: filters_parameters_20210304@1145.tsv

Reading VCF... 

Data summary: 
    number of samples: 19
    number of markers: 102155

Generating individual stats...
Generating markers stats...
[==================================================] 100%, completed, 0s
[==================================================] 100%, completed, 0s

Number of chromosome/contig/scaffold: 1
Number of locus: 11750
Number of markers: 102155
Number of strata: 7
Number of individuals: 19

Number of ind/strata:
BF = 1
FB = 1
GB = 3
HP = 1
OY = 3
SK = 4
TO = 6

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20210304@1145.gds
################################################################################
######################### radiator::filter_monomorphic #########################
################################################################################
Execution date@time: 20210304@1146
Function call and arguments stored in: radiator_filter_monomorphic_args_20210304@1146.tsv
File written: blacklist.monomorphic.markers_20210304@1146.tsv
Synchronizing markers.meta
File written: whitelist.polymorphic.markers_20210304@1146.tsv
################################### RESULTS ####################################

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Before: 19 / 7 / 1 / 11750 / 102155
    Blacklisted: 0 / 0 / 0 / 12 / 381
    After: 19 / 7 / 1 / 11738 / 101774

Computation time, overall: 1 sec
######################### completed filter_monomorphic #########################
################################################################################
####################### radiator::filter_common_markers ########################
################################################################################
Execution date@time: 20210304@1146
Function call and arguments stored in: radiator_filter_common_markers_args_20210304@1146.tsv
Scanning for common markers...

Strata with low sample size detected: fig <- FALSE

File written: blacklist.not.common.markers_20210304@1146.tsv
File written: whitelist.common.markers_20210304@1146.tsv
################################### RESULTS ####################################

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Before: 19 / 7 / 1 / 11738 / 101774
    Blacklisted: 0 / 0 / 0 / 8308 / 81652
    After: 19 / 7 / 1 / 3430 / 20122

Computation time, overall: 1 sec
####################### completed filter_common_markers ########################
################################################################################
############################# radiator::filter_ld ##############################
################################################################################
Execution date@time: 20210304@1146

Unknowned arguments identified inside "...": 
    iiiiverbose
    iinteractive.filter

Read documentation, for latest changes, and modify your codes!

Function call and arguments stored in: radiator_filter_ld_args_20210304@1146.tsv

Interactive mode: on

Step 1. Short distance LD threshold selection
Step 2. Filtering markers based on short distance LD
Step 3. Long distance LD pruning selection
Step 4. Threshold selection
Step 5. Filtering markers based on long distance LD

Minimizing short distance LD...
    The range in the number of SNP/locus is: 1-48

Step 1. Short distance LD threshold selection
the goal is to keep only 1 SNP per read/locus
Choose the filter.short.ld threshold
Options include:
1: mac (Not sure ? use mac...)
2: random
3: first
4: middle
5: last
1

Step 2. Filtering markers based on short distance LD
filter.short.ld = mac
[==================================================] 100%, completed, 0s
File written: whitelist.short.ld.tsv
File written: blacklist.short.ld.tsv
################################### RESULTS ####################################

Filter short ld threshold: mac
Number of individuals / strata / chrom / locus / SNP:
    Before: 19 / 7 / 1 / 3430 / 20122
    Blacklisted: 0 / 0 / 0 / 0 / 16692
    After: 19 / 7 / 1 / 3430 / 3430

Do you want to continue filtering using long distance ld  ? (y/n):
y

Step 3. Long distance LD pruning selection
Using missingness to select SNPs in LD is still under construction with de novo data
Basic pruning using SNPRelate is used

Long distance LD pruning WITHOUT missing data stats
Using a subset of the markers: 686
to change the default proportion: subsample.markers.stats argument

Linkage Disequilibrium (LD) estimation on genotypes
LD computation time: 0 sec
Generating statistics for boxplot
File written: ld.summary.stats.subsample.tsv
Generating boxplot

Step 4. Threshold selection
Look at the boxplot, a threshold of 0.2 blacklist more markers than 0.8

Enter the long LD threshold (filter.long.ld threshold, double/proportion):
0.9
Pruning with SNPRelate...
SNV pruning based on LD:
Calculating allele counts/frequencies ...
[==================================================] 100%, completed, 0s
# of selected variants: 3,430
    # of samples: 19
    # of SNVs: 3,430
    using 1 thread
    sliding window: 500,000 basepairs, Inf SNPs
    |LD| threshold: 0.9
    method: R
Chromosome 1: 0.92%, 937/102,155
937 markers are selected in total.
LD pruning computation time: 1 sec
Number of markers whitelised: 937
File written: whitelist.long.ld.tsv
File written: blacklist.long.ld.tsv
################################### RESULTS ####################################

Filter long ld threshold: 0.9
Number of individuals / strata / chrom / locus / SNP:
    Before: 19 / 7 / 1 / 3430 / 3430
    Blacklisted: 0 / 0 / 0 / 2493 / 2493
    After: 19 / 7 / 1 / 937 / 937

Computation time, overall: 101 sec
############################# completed filter_ld ##############################

Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
Writing the filtered strata: strata.filtered.tsvstrata.filtered.tsv

Generating statistics after filtering
calculating individual stats...
File written: individuals qc info and stats summary
File written: individuals qc plot
calculating markers stats...
[==================================================] 100%, completed, 0s
[==================================================] 100%, completed, 0s

Transferring data to genomic converter...
Calibrating REF/ALT alleles...
Synchronizing data and strata...
    Number of strata: 7
    Number of individuals: 19

Writing tidy data set:
radiator_data_20210304@1146.rad
Calibrating REF/ALT alleles...
Calibrating REF/ALT alleles...
Error: Can't subset columns that don't exist.
x Column `GT` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unknown or uninitialised column: `STRATA`. 
2: Unknown or uninitialised column: `STRATA`. 
3: Unknown or uninitialised column: `STRATA`. 

Computation time, overall: 91 sec
############################# completed filter_rad #############################

With the rlang expanded error:

<error/vctrs_error_subscript_oob>
Can't subset columns that don't exist.
x Column `GT` doesn't exist.
Backtrace:
  1. radiator::filter_rad(...)
  6. dplyr:::select.data.frame(., POP_ID, INDIVIDUALS, MARKERS, GT)
  7. tidyselect::eval_select(expr(c(...)), .data)
  8. tidyselect:::eval_select_impl(...)
 16. tidyselect:::vars_select_eval(...)
 17. tidyselect:::walk_data_tree(expr, data_mask, context_mask)
 18. tidyselect:::eval_c(expr, data_mask, context_mask)
 19. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
 20. tidyselect:::walk_data_tree(new, data_mask, context_mask)
 21. tidyselect:::as_indices_sel_impl(...)
 22. tidyselect:::as_indices_impl(x, vars, strict = strict)
 23. tidyselect:::chr_as_locations(x, vars)
 24. vctrs::vec_as_location(x, n = length(vars), names = vars)
 26. vctrs:::stop_subscript_oob(...)
 27. vctrs:::stop_subscript(...)
Run `rlang::last_trace()` to see the full context.

Devtools session info:

- Session info --------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.3 (2020-10-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_Australia.1252      
 ctype    English_Australia.1252      
 tz       Australia/Perth             
 date     2021-03-04                  

- Packages ------------------------------------------------------------------------------------------------------------
 package          * version  date       lib source                                   
 ade4             * 1.7-16   2020-10-28 [1] CRAN (R 4.0.3)                           
 adegenet         * 2.1.3    2020-12-30 [1] Github (thibautjombart/adegenet@78be588) 
 ape                5.4-1    2020-08-13 [1] CRAN (R 4.0.2)                           
 apex               1.0.4    2020-04-11 [1] CRAN (R 4.0.3)                           
 assertthat         0.2.1    2019-03-21 [1] CRAN (R 4.0.2)                           
 BiocGenerics       0.36.0   2020-10-27 [1] Bioconductor                             
 Biostrings         2.58.0   2020-10-27 [1] Bioconductor                             
 bitops             1.0-6    2013-08-17 [1] CRAN (R 4.0.0)                           
 boot               1.3-25   2020-04-26 [1] CRAN (R 4.0.2)                           
 callr              3.5.1    2020-10-13 [1] CRAN (R 4.0.3)                           
 class              7.3-17   2020-04-26 [1] CRAN (R 4.0.2)                           
 classInt           0.4-3    2020-04-07 [1] CRAN (R 4.0.2)                           
 cli                2.3.1    2021-02-23 [1] CRAN (R 4.0.3)                           
 cluster            2.1.0    2019-06-19 [1] CRAN (R 4.0.2)                           
 coda               0.19-4   2020-09-30 [1] CRAN (R 4.0.2)                           
 codetools          0.2-18   2020-11-04 [1] CRAN (R 4.0.3)                           
 colorspace         2.0-0    2020-11-11 [1] CRAN (R 4.0.3)                           
 crayon             1.4.1    2021-02-08 [1] CRAN (R 4.0.3)                           
 data.table         1.14.0   2021-02-21 [1] CRAN (R 4.0.3)                           
 DBI                1.1.1    2021-01-15 [1] CRAN (R 4.0.3)                           
 deldir             0.2-3    2020-11-09 [1] CRAN (R 4.0.3)                           
 desc               1.2.0    2018-05-01 [1] CRAN (R 4.0.2)                           
 devtools           2.3.2    2020-09-18 [1] CRAN (R 4.0.2)                           
 digest             0.6.27   2020-10-24 [1] CRAN (R 4.0.3)                           
 dplyr            * 1.0.4    2021-02-02 [1] CRAN (R 4.0.3)                           
 e1071              1.7-4    2020-10-14 [1] CRAN (R 4.0.3)                           
 ellipsis           0.3.1    2020-05-15 [1] CRAN (R 4.0.2)                           
 evaluate           0.14     2019-05-28 [1] CRAN (R 4.0.2)                           
 expm               0.999-5  2020-07-20 [1] CRAN (R 4.0.2)                           
 fansi              0.4.2    2021-01-15 [1] CRAN (R 4.0.3)                           
 farver             2.1.0    2021-02-28 [1] CRAN (R 4.0.3)                           
 fastmap            1.0.1    2019-10-08 [1] CRAN (R 4.0.2)                           
 fastmatch          1.1-0    2017-01-28 [1] CRAN (R 4.0.0)                           
 fs                 1.5.0    2020-07-31 [1] CRAN (R 4.0.2)                           
 fst                0.9.4    2020-08-27 [1] CRAN (R 4.0.3)                           
 gdata            * 2.18.0   2017-06-06 [1] CRAN (R 4.0.3)                           
 gdsfmt           * 1.26.1   2020-12-22 [1] Bioconductor                             
 generics           0.1.0    2020-10-31 [1] CRAN (R 4.0.3)                           
 GenomeInfoDb       1.26.2   2020-12-08 [1] Bioconductor                             
 GenomeInfoDbData   1.2.4    2020-12-29 [1] Bioconductor                             
 GenomicRanges      1.42.0   2020-10-27 [1] Bioconductor                             
 ggplot2            3.3.3    2020-12-30 [1] CRAN (R 4.0.3)                           
 glue               1.4.2    2020-08-27 [1] CRAN (R 4.0.2)                           
 gmodels            2.18.1   2018-06-25 [1] CRAN (R 4.0.2)                           
 gridExtra          2.3      2017-09-09 [1] CRAN (R 4.0.3)                           
 gtable             0.3.0    2019-03-25 [1] CRAN (R 4.0.2)                           
 gtools             3.8.2    2020-03-31 [1] CRAN (R 4.0.3)                           
 hms                1.0.0    2021-01-13 [1] CRAN (R 4.0.3)                           
 htmltools          0.5.1.1  2021-01-22 [1] CRAN (R 4.0.3)                           
 httpuv             1.5.4    2020-06-06 [1] CRAN (R 4.0.2)                           
 igraph             1.2.6    2020-10-06 [1] CRAN (R 4.0.2)                           
 IRanges            2.24.1   2020-12-12 [1] Bioconductor                             
 KernSmooth         2.23-18  2020-10-29 [1] CRAN (R 4.0.3)                           
 knitr              1.31     2021-01-27 [1] CRAN (R 4.0.3)                           
 labeling           0.4.2    2020-10-20 [1] CRAN (R 4.0.3)                           
 later              1.1.0.1  2020-06-05 [1] CRAN (R 4.0.2)                           
 lattice            0.20-41  2020-04-02 [1] CRAN (R 4.0.2)                           
 LearnBayes         2.15.1   2018-03-18 [1] CRAN (R 4.0.0)                           
 lifecycle          1.0.0    2021-02-15 [1] CRAN (R 4.0.4)                           
 magrittr           2.0.1    2020-11-17 [1] CRAN (R 4.0.3)                           
 MASS               7.3-53   2020-09-09 [1] CRAN (R 4.0.2)                           
 Matrix             1.3-0    2020-12-22 [1] CRAN (R 4.0.3)                           
 memoise            1.1.0    2017-04-21 [1] CRAN (R 4.0.2)                           
 mgcv               1.8-33   2020-08-27 [1] CRAN (R 4.0.2)                           
 mime               0.10     2021-02-13 [1] CRAN (R 4.0.4)                           
 munsell            0.5.0    2018-06-12 [1] CRAN (R 4.0.2)                           
 nlme               3.1-151  2020-12-10 [1] CRAN (R 4.0.3)                           
 permute            0.9-5    2019-03-12 [1] CRAN (R 4.0.2)                           
 phangorn           2.5.5    2019-06-19 [1] CRAN (R 4.0.2)                           
 pillar             1.5.0    2021-02-22 [1] CRAN (R 4.0.3)                           
 pinfsc50           1.2.0    2020-06-03 [1] CRAN (R 4.0.0)                           
 pkgbuild           1.2.0    2020-12-15 [1] CRAN (R 4.0.3)                           
 pkgconfig          2.0.3    2019-09-22 [1] CRAN (R 4.0.2)                           
 pkgload            1.2.0    2021-02-23 [1] CRAN (R 4.0.3)                           
 plyr               1.8.6    2020-03-03 [1] CRAN (R 4.0.2)                           
 prettyunits        1.1.1    2020-01-24 [1] CRAN (R 4.0.2)                           
 processx           3.4.5    2020-11-30 [1] CRAN (R 4.0.3)                           
 progress           1.2.2    2019-05-16 [1] CRAN (R 4.0.2)                           
 promises           1.1.1    2020-06-09 [1] CRAN (R 4.0.2)                           
 ps                 1.6.0    2021-02-28 [1] CRAN (R 4.0.3)                           
 purrr              0.3.4    2020-04-17 [1] CRAN (R 4.0.2)                           
 quadprog           1.5-8    2019-11-20 [1] CRAN (R 4.0.0)                           
 R6                 2.5.0    2020-10-28 [1] CRAN (R 4.0.3)                           
 radiator         * 1.1.9    2021-03-02 [1] Github (thierrygosselin/radiator@5c6b865)
 raster             3.4-5    2020-11-14 [1] CRAN (R 4.0.3)                           
 Rcpp               1.0.6    2021-01-15 [1] CRAN (R 4.0.3)                           
 RCurl              1.98-1.2 2020-04-18 [1] CRAN (R 4.0.3)                           
 readr              1.4.0    2020-10-05 [1] CRAN (R 4.0.2)                           
 remotes            2.2.0    2020-07-21 [1] CRAN (R 4.0.2)                           
 reshape2           1.4.4    2020-04-09 [1] CRAN (R 4.0.3)                           
 rlang              0.4.10   2020-12-30 [1] CRAN (R 4.0.3)                           
 rmarkdown          2.7      2021-02-19 [1] CRAN (R 4.0.3)                           
 rprojroot          2.0.2    2020-11-15 [1] CRAN (R 4.0.3)                           
 rstudioapi         0.13     2020-11-12 [1] CRAN (R 4.0.3)                           
 S4Vectors          0.28.1   2020-12-09 [1] Bioconductor                             
 scales             1.1.1    2020-05-11 [1] CRAN (R 4.0.2)                           
 SeqArray           1.30.0   2020-10-28 [1] Bioconductor                             
 seqinr             4.2-5    2020-12-17 [1] CRAN (R 4.0.3)                           
 sessioninfo        1.1.1    2018-11-05 [1] CRAN (R 4.0.2)                           
 sf                 0.9-6    2020-09-13 [1] CRAN (R 4.0.2)                           
 shiny              1.5.0    2020-06-23 [1] CRAN (R 4.0.2)                           
 SNPRelate        * 1.22.0   2020-04-27 [1] Bioconductor                             
 sp                 1.4-4    2020-10-07 [1] CRAN (R 4.0.2)                           
 spData             0.3.8    2020-07-03 [1] CRAN (R 4.0.2)                           
 spdep              1.1-5    2020-06-29 [1] CRAN (R 4.0.2)                           
 strataG          * 2.4.910  2020-12-30 [1] Github (ericarcher/strataG@e5f4812)      
 stringi            1.5.3    2020-09-09 [1] CRAN (R 4.0.2)                           
 stringr            1.4.0    2019-02-10 [1] CRAN (R 4.0.2)                           
 testthat           3.0.1    2020-12-17 [1] CRAN (R 4.0.3)                           
 tibble             3.1.0    2021-02-25 [1] CRAN (R 4.0.3)                           
 tidyr              1.1.2    2020-08-27 [1] CRAN (R 4.0.2)                           
 tidyselect       * 1.1.0    2020-05-11 [1] CRAN (R 4.0.2)                           
 tinytex            0.29     2021-01-21 [1] CRAN (R 4.0.3)                           
 units              0.6-7    2020-06-13 [1] CRAN (R 4.0.2)                           
 UpSetR             1.4.0    2019-05-22 [1] CRAN (R 4.0.3)                           
 usethis            2.0.0    2020-12-10 [1] CRAN (R 4.0.3)                           
 utf8               1.1.4    2018-05-24 [1] CRAN (R 4.0.2)                           
 vcfR             * 1.12.0   2020-09-01 [1] CRAN (R 4.0.3)                           
 vctrs              0.3.6    2020-12-17 [1] CRAN (R 4.0.3)                           
 vegan              2.5-7    2020-11-28 [1] CRAN (R 4.0.3)                           
 viridisLite        0.3.0    2018-02-01 [1] CRAN (R 4.0.2)                           
 withr              2.4.1    2021-01-26 [1] CRAN (R 4.0.3)                           
 xfun               0.19     2020-10-30 [1] CRAN (R 4.0.3)                           
 xtable             1.8-4    2019-04-21 [1] CRAN (R 4.0.2)                           
 XVector            0.30.0   2020-10-28 [1] Bioconductor                             
 yaml               2.2.1    2020-02-01 [1] CRAN (R 4.0.2)                           
 zlibbioc           1.36.0   2020-10-28 [1] Bioconductor                             

[1] C:/Users/blair/Documents/R/win-library/4.0
[2] C:/Program Files/R/R-4.0.3/library

If it helps I can provide the VCF for replication. Thanks in advance for you help with this super useful package!

Kind regards, Blair

tomoosting commented 3 years ago

Not sure if it helps but I have a similar error message when using write_bayescan: Error: Can't subset columns that don't exist. x ColumnPOP_IDdoesn't exist.

Runnign the following code: radiator::write_bayescan(data = new.tidy, pop.select = NULL, filename = "test.baye", parallel.core = 1L)

While the POP_ID column is present in the data frame: str(new.tidy) tibble [54,600 x 14] (S3: tbl_df/tbl/data.frame) $ POP_ID : Factor w/ 10 levels "Hauraki","Northland",..: 1 1 1 1 1 1 2 2 2 2 ... $ INDIVIDUALS : chr [1:54600] "CA115001" "CA115001" "CA115002" "CA115003" ... $ VARIANT_ID : int [1:54600] 1 1 1 1 1 1 1 1 1 1 ... $ MARKERS : chr [1:54600] "134092053409205" "134092053409205" "134092053409205" "134092053409205" ... $ CHROM : chr [1:54600] "1" "1" "1" "1" ... $ LOCUS : chr [1:54600] "3409205" "3409205" "3409205" "3409205" ... $ POS : chr [1:54600] "3409205" "3409205" "3409205" "3409205" ... $ REF : chr [1:54600] "T" "T" "T" "T" ... $ ALT : chr [1:54600] "C" "C" "C" "C" ... $ GT_BIN : int [1:54600] 2 2 2 1 1 0 0 0 1 2 ... $ ALLELE_REF_DEPTH: int [1:54600] NA NA NA 16 39 NA 8 5 NA NA ... $ ALLELE_ALT_DEPTH: int [1:54600] 28 21 NA 26 NA NA NA 7 NA 15 ... $ READ_DEPTH : int [1:54600] 28 21 NA 42 39 NA 8 12 NA 15 ... $ GQ : int [1:54600] 83 63 NA 99 0 NA 24 99 NA 45 ...

Would be great if this issue can be resolved :) Cheers!

thierrygosselin commented 3 years ago

should be fixed with latest release