thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
57 stars 23 forks source link

Error in filter_rad() on DArT data #68

Closed IdoBar closed 4 years ago

IdoBar commented 4 years ago

Hi Thierry,

I've come back to radiator to process some DArT data and I'm glad to see how it matured in the last couple of years.

I did came across a potential bug when filtering the data (both SNP and silico-dart). I could read in the data without a problem with read_dart(), but filtering is raising an error:

dart_strata <- read_csv("data/DArT_metadata.csv") %>% select(TARGET_ID=id, STRATA=State) %>% 
  mutate(INDIVIDUALS= TARGET_ID)
Arab_tidy_dart <- read_dart("data/Report-DAsc19-4353_ArabieiPlate3/Report_DAsc19-4353_1_moreOrders_SilicoDArT_1.csv", strata = dart_strata, 
                            tidy.dart = TRUE, path.folder = "output")
Arab_dart <- filter_rad(Arab_tidy_dart, strata = dart_strata, output = c("genind"), 
                        path.folder = "output")

This gives the following error:

The function arguments names have changed: please read documentation

Execution date@time: 20200122@0318 Folder created: filter_rad_20200122@0318 Function call and arguments stored in: radiator_filter_rad_args_20200122@0318.tsv File written: random.seed (31953) Filters parameters file generated: filters_parameters20200122@0318.tsv Generating GDS... Error in UseMethod("mutate") : no applicable method for 'mutate_' applied to an object of class "NULL"

Computation time, overall: 0 sec

The same error occurs if I use the SNPs rater than the silicoDArT markers. This is the output of session_info():

- Session info --------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows >= 8 x64            
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_Australia.1252      
 ctype    English_Australia.1252      
 tz       Australia/Brisbane          
 date     2020-01-22                  

- Packages ------------------------------------------------------------------------------------------------------------------------------------------------
 package          * version   date       lib source                                   
 ade4             * 1.7-13    2018-08-31 [1] CRAN (R 3.5.1)                           
 adegenet         * 2.1.1     2018-02-02 [1] CRAN (R 3.5.3)                           
 ape                5.3       2019-03-17 [1] CRAN (R 3.5.3)                           
 assertthat         0.2.1     2019-03-21 [1] CRAN (R 3.5.3)                           
 backports          1.1.5     2019-10-02 [1] CRAN (R 3.5.3)                           
 BiocGenerics       0.28.0    2018-10-30 [1] Bioconductor                             
 BiocManager        1.30.4    2018-11-13 [1] CRAN (R 3.5.1)                           
 Biostrings         2.50.2    2019-01-03 [1] Bioconductor                             
 bitops             1.0-6     2013-08-17 [1] CRAN (R 3.5.0)                           
 boot               1.3-23    2019-07-05 [2] CRAN (R 3.5.3)                           
 broom              0.5.3     2019-12-14 [1] CRAN (R 3.5.3)                           
 callr              3.4.0     2019-12-09 [1] CRAN (R 3.5.3)                           
 cellranger         1.1.0     2016-07-27 [1] CRAN (R 3.5.1)                           
 class              7.3-15    2019-01-01 [2] CRAN (R 3.5.3)                           
 classInt           0.3-3     2019-04-26 [1] CRAN (R 3.5.3)                           
 cli                2.0.1     2020-01-08 [1] CRAN (R 3.5.3)                           
 cluster            2.1.0     2019-06-19 [1] CRAN (R 3.5.3)                           
 coda               0.19-3    2019-07-05 [1] CRAN (R 3.5.3)                           
 colorspace         1.4-1     2019-03-18 [1] CRAN (R 3.5.3)                           
 crayon             1.3.4     2017-09-16 [1] CRAN (R 3.5.1)                           
 curl               4.3       2019-12-02 [1] CRAN (R 3.5.3)                           
 data.table         1.12.8    2019-12-09 [1] CRAN (R 3.5.3)                           
 DBI                1.1.0     2019-12-15 [1] CRAN (R 3.5.3)                           
 dbplyr             1.4.2     2019-06-17 [1] CRAN (R 3.5.3)                           
 deldir             0.1-22    2019-07-05 [1] CRAN (R 3.5.1)                           
 desc               1.2.0     2018-05-01 [1] CRAN (R 3.5.1)                           
 devtools           2.1.0     2019-07-06 [1] CRAN (R 3.5.3)                           
 digest             0.6.23    2019-11-23 [1] CRAN (R 3.5.3)                           
 dplyr            * 0.8.3     2019-07-04 [1] CRAN (R 3.5.1)                           
 e1071              1.7-2     2019-06-05 [1] CRAN (R 3.5.3)                           
 expm               0.999-4   2019-03-21 [1] CRAN (R 3.5.3)                           
 fansi              0.4.1     2020-01-08 [1] CRAN (R 3.5.3)                           
 forcats          * 0.4.0     2019-02-17 [1] CRAN (R 3.5.2)                           
 fs                 1.3.1     2019-05-06 [1] CRAN (R 3.5.3)                           
 fst                0.9.0     2019-04-09 [1] CRAN (R 3.5.3)                           
 gdata              2.18.0    2017-06-06 [1] CRAN (R 3.5.2)                           
 gdsfmt             1.18.1    2018-11-27 [1] Bioconductor                             
 GenABEL            1.8-0     2013-12-27 [1] CRAN (R 3.5.1)                           
 generics           0.0.2     2018-11-29 [1] CRAN (R 3.5.3)                           
 GenomeInfoDb       1.18.2    2019-02-12 [1] Bioconductor                             
 GenomeInfoDbData   1.2.0     2020-01-21 [1] Bioconductor                             
 GenomicRanges      1.34.0    2018-10-30 [1] Bioconductor                             
 ggplot2          * 3.2.1     2019-08-10 [1] CRAN (R 3.5.3)                           
 glue               1.3.1     2019-03-12 [1] CRAN (R 3.5.3)                           
 gmodels            2.18.1    2018-06-25 [1] CRAN (R 3.5.2)                           
 gtable             0.3.0     2019-03-25 [1] CRAN (R 3.5.3)                           
 gtools             3.8.1     2018-06-26 [1] CRAN (R 3.5.0)                           
 haven              2.2.0     2019-11-08 [1] CRAN (R 3.5.3)                           
 hms                0.5.3     2020-01-08 [1] CRAN (R 3.5.3)                           
 htmltools          0.4.0     2019-10-04 [1] CRAN (R 3.5.3)                           
 httpuv             1.5.1     2019-04-05 [1] CRAN (R 3.5.3)                           
 httr               1.4.1     2019-08-05 [1] CRAN (R 3.5.3)                           
 igraph             1.2.4.1   2019-04-22 [1] CRAN (R 3.5.3)                           
 IRanges            2.16.0    2018-10-30 [1] Bioconductor                             
 jsonlite           1.6       2018-12-07 [1] CRAN (R 3.5.2)                           
 KernSmooth         2.23-15   2015-06-29 [2] CRAN (R 3.5.1)                           
 later              0.8.0     2019-02-11 [1] CRAN (R 3.5.3)                           
 lattice            0.20-38   2018-11-04 [1] CRAN (R 3.5.3)                           
 lazyeval           0.2.2     2019-03-15 [1] CRAN (R 3.5.3)                           
 LearnBayes         2.15.1    2018-03-18 [1] CRAN (R 3.5.2)                           
 lifecycle          0.1.0     2019-08-01 [1] CRAN (R 3.5.3)                           
 lubridate          1.7.4     2018-04-11 [1] CRAN (R 3.5.1)                           
 magrittr           1.5       2014-11-22 [1] CRAN (R 3.5.1)                           
 MASS               7.3-51.4  2019-04-26 [2] CRAN (R 3.5.3)                           
 Matrix             1.2-17    2019-03-22 [1] CRAN (R 3.5.3)                           
 memoise            1.1.0     2017-04-21 [1] CRAN (R 3.5.1)                           
 mgcv               1.8-28    2019-03-21 [2] CRAN (R 3.5.3)                           
 mime               0.8       2019-12-19 [1] CRAN (R 3.5.3)                           
 modelr             0.1.5     2019-08-08 [1] CRAN (R 3.5.3)                           
 munsell            0.5.0     2018-06-12 [1] CRAN (R 3.5.1)                           
 nlme               3.1-140   2019-05-12 [1] CRAN (R 3.5.1)                           
 pacman             0.5.1     2019-03-11 [1] CRAN (R 3.5.3)                           
 permute            0.9-5     2019-03-12 [1] CRAN (R 3.5.3)                           
 pillar             1.4.3     2019-12-20 [1] CRAN (R 3.5.3)                           
 pkgbuild           1.0.3     2019-03-20 [1] CRAN (R 3.5.3)                           
 pkgconfig          2.0.3     2019-09-22 [1] CRAN (R 3.5.3)                           
 pkgload            1.0.2     2018-10-29 [1] CRAN (R 3.5.1)                           
 plyr               1.8.5     2019-12-10 [1] CRAN (R 3.5.3)                           
 prettyunits        1.1.0     2020-01-09 [1] CRAN (R 3.5.3)                           
 processx           3.4.1     2019-07-18 [1] CRAN (R 3.5.3)                           
 promises           1.0.1     2018-04-13 [1] CRAN (R 3.5.1)                           
 ps                 1.3.0     2018-12-21 [1] CRAN (R 3.5.3)                           
 purrr            * 0.3.3     2019-10-18 [1] CRAN (R 3.5.3)                           
 R6                 2.4.1     2019-11-12 [1] CRAN (R 3.5.3)                           
 radiator         * 1.1.2     2020-01-21 [1] Github (thierrygosselin/radiator@480ef0f)
 RColorBrewer       1.1-2     2014-12-07 [1] CRAN (R 3.5.0)                           
 Rcpp               1.0.3     2019-11-08 [1] CRAN (R 3.5.3)                           
 RCurl              1.95-4.13 2020-01-16 [1] CRAN (R 3.5.3)                           
 readr            * 1.3.1     2018-12-21 [1] CRAN (R 3.5.3)                           
 readxl             1.3.1     2019-03-13 [1] CRAN (R 3.5.3)                           
 remotes            2.1.0     2019-06-24 [1] CRAN (R 3.5.3)                           
 reprex             0.3.0     2019-05-16 [1] CRAN (R 3.5.3)                           
 reshape2           1.4.3     2017-12-11 [1] CRAN (R 3.5.1)                           
 rlang              0.4.2     2019-11-23 [1] CRAN (R 3.5.3)                           
 rprojroot          1.3-2     2018-01-03 [1] CRAN (R 3.5.1)                           
 rstudioapi         0.10      2019-03-19 [1] CRAN (R 3.5.1)                           
 rvest              0.3.5     2019-11-08 [1] CRAN (R 3.5.3)                           
 S4Vectors          0.20.1    2018-11-09 [1] Bioconductor                             
 scales             1.1.0     2019-11-18 [1] CRAN (R 3.5.3)                           
 SeqArray           1.22.6    2019-03-10 [1] Bioconductor                             
 seqinr             3.4-5     2017-08-01 [1] CRAN (R 3.5.1)                           
 sessioninfo        1.1.1     2018-11-05 [1] CRAN (R 3.5.1)                           
 sf                 0.7-6     2019-07-05 [1] CRAN (R 3.5.3)                           
 shiny              1.3.2     2019-04-22 [1] CRAN (R 3.5.1)                           
 sp                 1.3-1     2018-06-05 [1] CRAN (R 3.5.1)                           
 spData             0.3.0     2019-01-07 [1] CRAN (R 3.5.3)                           
 spdep              1.1-2     2019-04-05 [1] CRAN (R 3.5.3)                           
 stringi            1.4.5     2020-01-11 [1] CRAN (R 3.5.3)                           
 stringr          * 1.4.0     2019-02-10 [1] CRAN (R 3.5.3)                           
 testthat           2.1.1     2019-04-23 [1] CRAN (R 3.5.3)                           
 tibble           * 2.1.3     2019-06-06 [1] CRAN (R 3.5.3)                           
 tidyr            * 1.0.0     2019-09-11 [1] CRAN (R 3.5.3)                           
 tidyselect         0.2.5     2018-10-11 [1] CRAN (R 3.5.1)                           
 tidyverse        * 1.3.0     2019-11-21 [1] CRAN (R 3.5.3)                           
 units              0.6-3     2019-05-03 [1] CRAN (R 3.5.3)                           
 usethis            1.5.1     2019-07-04 [1] CRAN (R 3.5.3)                           
 vctrs              0.2.1     2019-12-17 [1] CRAN (R 3.5.3)                           
 vegan              2.5-5     2019-05-12 [1] CRAN (R 3.5.3)                           
 withr              2.1.2     2018-03-15 [1] CRAN (R 3.5.1)                           
 xml2               1.2.2     2019-08-09 [1] CRAN (R 3.5.3)                           
 xtable             1.8-4     2019-04-21 [1] CRAN (R 3.5.1)                           
 XVector            0.22.0    2018-10-30 [1] Bioconductor                             
 yaml               2.2.0     2018-07-25 [1] CRAN (R 3.5.1)                           
 zeallot            0.1.0     2018-01-28 [1] CRAN (R 3.5.3)                           
 zlibbioc           1.28.0    2018-10-30 [1] Bioconductor                             

[1] C:/Users/Adido/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library

Dataset included below

DArT_set.zip

thierrygosselin commented 4 years ago

Hi Ido, let me have a look...

thierrygosselin commented 4 years ago

1. the strata ... worked:

dart_strata <- readr::read_csv(file = "DArT_metadata.csv") %>% 
  dplyr::select(TARGET_ID = id, STRATA = State) %>% 
  dplyr::mutate(INDIVIDUALS = TARGET_ID)

You can verify that it's actually those in the DArT files using:

target.id <- radiator::extract_dart_target_id(data = "Report_DAsc19-4353_1_moreOrders_SNP_2.csv")

Note: Having / in your sample name might present some challenges in some packages/software, I don't remember testing it in radiator, so you might want to keep an eye on this...

2. silico dart data

The command below works

dir.create("output")
Arab_tidy_dart <- radiator::read_dart(
  data = "Report_DAsc19-4353_1_moreOrders_SilicoDArT_1.csv", 
  strata = dart_strata, 
  path.folder = "output",
  tidy.dart = TRUE
  )

Note: Not sure why you're using the output of that function as the input of radiator::filter_rad but it won't work.

3. filter_rad

Not sure what's your intentions to use the silico dart data inside filter_rad, but it won't work. It was not designed for this, I'll update the doc.

Ideally with DArT data you wan the count file (it's not automatically given by DArT, you have to ask). It's a file with coverage for both alleles. Similar to a VCF file with read depth info.

In the folder you have it's a genotype file with presence/absence in 2 rows, no count info, it works but it's limited for filtering for QC.

Below works for me:

Arab_dart <- radiator::filter_rad(
  data = "Report_DAsc19-4353_1_moreOrders_SNP_2.csv", 
  strata = dart_strata,
  output = c("genind"), 
  path.folder = "output"
  )

Using radiator v.1.1.3 (I've just pushed a new version with new function detect_microsatellites)

Cheers Thierry

IdoBar commented 4 years ago

Thanks for the prompt reply. It actually didn't work with DArT SNPs either, but I tested it now with the most recent build (and removed all the previous directories created by radiator) and it worked. I also tried to convert the silico-dart table to genind object and it failed. Trying to convert straight from the silico-dart file:

Arab_genind <- genomic_converter("data/Report-DAsc19-4353_ArabieiPlate3/Report_DAsc19-4353_1_moreOrders_SilicoDArT_1.csv", 
                                      strata = dart_strata, output = c("genind"))
################################################################################
######################### radiator::genomic_converter ##########################
################################################################################
Execution date@time: 20200122@1324
Folder created: -118_radiator_genomic_converter_20200122@1324
Function call and arguments stored in: radiator_genomic_converter_args_20200122@1324.tsv
Filters parameters file generated: filters_parameters_20200122@1324.tsv

Importing data

Error in generate_strata(input, pop.id = TRUE) : object 'input' not found

Computation time, overall: 0 sec

Computation time, overall: 0 sec
######################### completed genomic_converter ##########################

Trying to read it in as Tidy format first and then convert to genind

Arab_tidy_dart <- read_dart("data/Report-DAsc19-4353_ArabieiPlate3/Report_DAsc19-4353_1_moreOrders_SilicoDArT_1.csv", 
                            strata = dart_strata, output = c("genind"),
                            tidy.dart = TRUE, path.folder = "output")
Reading DArT file...
Number of blacklisted samples: 0
DArT SNP format: silico DArT
Synchronizing data and strata...
    Number of strata: 6
    Number of individuals: 281

Number of clones: 1841
Number of populations: 6
Number of individuals: 281

Number of ind/pop:
SA = 86
VIC = 58
WA = 48
SPAIN = 2
NSW = 65
QLD = 22

Number of duplicate id: 0

Computation time, overall: 2 sec

Arab_genind <- genomic_converter(Arab_tidy_dart, 
                                      strata = dart_strata, output = c("genind"))
################################################################################
######################### radiator::genomic_converter ##########################
################################################################################
Execution date@time: 20200122@0956
Folder created: -119_radiator_genomic_converter_20200122@0956
Function call and arguments stored in: radiator_genomic_converter_args_20200122@0956.tsv
Filters parameters file generated: filters_parameters_20200122@0956.tsv

Importing data

Synchronizing data and strata...
    Number of strata: 6
    Number of individuals: 281
Calibrating REF/ALT alleles...
The separator specified is not valid

Computation time, overall: 2 sec

Computation time, overall: 2 sec

Writing tidy data set:
radiator_data_20200122@0956.rad

Computation time, overall: 304 sec

Preparing data for output

Scanning for number of alleles per marker...
Data is multi-allelic
Generating adegenet genind object
Error in .local(.Object, ...) : 
  more than one '.' in column names; please name column as [LOCUS].[ALLELE]
In addition: There were 28 warnings (use warnings() to see them)

Computation time, overall: 315 sec
######################### completed genomic_converter ##########################

Thanks for looking into it.

thierrygosselin commented 4 years ago

Hi Ido, The silico dart data won't work in those functions, certainly not a format you can transform into a genind object. I suggest you read the DArT doc about the different format.

Currently, besides reading and tidying the silicons dart data, the only function in radiator you can use silico dart (will double check that) as input is in the sex markers function.

IdoBar commented 4 years ago

Why can't you transform them into a genind object? They can fit the PA (present/absent) type of markers (look at the relevant section in adegenet basics tutorial). I'll write a function for converting them from a tidy format to genind and will share it if you're interested.

thierrygosselin commented 4 years ago

adegenet as very limited use for them and I have no idea what to do with presence/absence data besides the sex markers function... radiator::filter_rad requires alleles and genotypes to do the filtering.

At one point I was developing the filtering functions to use genotype likelihood (GL/PL field in VCF) and stopped because most (~99%) of the datasets I saw using these where from low coverage experiments and where very very low quality. When you're doing this, you must have 100% control of the wet lab steps and bioinformatically experienced, at this point radiator is not really relevant.

Go for counts DArT data if you can for filter_rad and if it's impossible for you to get it, genotypes in 2 rows is ok.

thierrygosselin commented 4 years ago

Below is just an example of the limit of presence/absence:

However, it is clear that the usual Euclidean distance (used in PCA and sPCA), as well as many other distances, is not as accurate to measure genetic dissimilarity using presence/absence data as it is when using allele frequencies. The reason for this is that in presence/absence data, a part of the information is simply hidden. For instance, two individuals possessing the same allele will be considered at the same distance, whether they possess one or more copies of the allele. This might be especially problematic in organisms having a high degree of ploidy.