Closed neutronstar21 closed 1 year ago
Dear Dean, Sorry for the late reply on this.
If this is still relevant, try installing the latest radiator.
Couple of steps before using filter_rad
library(radiator)
shark <- radiator::read_dart(
data = "Report_DSha21-6120_SNP_mapping_2_DCB_w_4D_Sample_Code.csv",
strata = "popmap_no_reps_by_min_missing_loci_perc_wo_2_inds__POP_STRATA_v2.tsv",
verbose = TRUE
)
It's a bit long but worth waiting for the speed GDS gives after...
################################### SUMMARY ####################################
Number of chrom: 1
Number of locus: 18882
Number of SNPs: 20675
Number of strata: 1
Number of individuals: 519
Number of ind/strata:
NSW = 519
Number of duplicate id: 0
Computation time, overall: 110 sec
Next, what I like to do when I receive a data set is to check a couple of QC steps
radiator::detect_duplicate_genomes
will check for technical duplicates (DArT use those) but with most project, I see a lot of wet lab problems or sampling errors...test1 <- radiator::detect_duplicate_genomes(data = shark)
In the folder it will generate, go check this figure: manhattan.plot.distance.png
Answer n
to the question on the console, don't blacklist samples just yet...
radiator::detect_mixed_genomes
look at individual heterozygosity. It's a good way for me to see if the problem of data quality comes from. Usually, it's wet lab, sometimes it's tissue quality, I've seen this with white shark data coming from DNA tissue taken from death shark from beach protective net. You just don't how long the dead shark was in the water...test2 <- radiator::detect_mixed_genomes(data = shark)
The interesting figure here is individual.heterozygosity.manhattan.plot.png
. This shows the outlier samples in the data...
Too many het locus in an individual makes it look closer to everybody (he share an allele with everyone), to few is usually because of bad DNA, lots of missing data (big bubble, below the IQR, if you like box plot).
When you have tiny bubbles higher in the figure that stands out it's a number of things:
I usually filter out the duplicates and the mixed individuals inside filter_rad
but here you could already take those out just to see the outcome...
To the question on the console:
Inspect plots and tables in folder created...
Do you want to exclude individuals based on heterozygosity ? (y/n):
answer: y
The threshold I use here are based on the figure (the subtitle: overall data outlier thresholds (low/high): 0.166314/0.199396
...
Enter the min value for ind.heterozygosity.threshold argument (0 turns off):
use: 0.166314
Enter the max value for ind.heterozygosity.threshold argument (1 turns off):
use: 0.199396
Filter individual's heterozygosity: 35 individual(s) blacklisted
################################### RESULTS ####################################
Detect mixed genomes: 0.166314 0.199396
Number of individuals / strata / chrom / locus / SNP:
Before: 519 / 1 / 1 / 18882 / 20675
Blacklisted: 35 / 0 / 0 / 0 / 0
After: 484 / 1 / 1 / 18882 / 20675
Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 0 / 1313 / 1610
Computation time, overall: 27 sec
######################## completed detect_mixed_genomes ########################
radiator::detect_duplicate_genomes
test3 <- radiator::detect_duplicate_genomes(data = test2)
You will see how much the figure as changed, for the better. Test1 you did showed samples that are clearly duplicates (the bubble close to 0. The ones above, 0.25 ... not normal. It should be below 0.5 unless you sampled families of shark all linked or related ...
The test3 is what I would expect from a normal DArT dataset.
Inspect tables and figures to decide if some individual(s) need to be blacklisted
Do you need to blacklist individual(s) (y/n):
type: y
2 options to remove duplicates:
1. threshold: using the figure you choose a threshold. It's more powerful to fully remove duplicates
2. manually: the function generate a blacklist that you have to complete
Note: not sure ? Use option 1, it's more powerful to fully remove duplicates
Enter the option to remove duplicates (1/2):
use : 1
Enter the threshold to remove duplicates: (between 0 and 1)
use : 0.25
2 options to remove duplicates involved in pairs from different strata/group:
(the black points on the figure, above your threshold)
1: blacklist both samples in the pair
2: blacklist only 1 sample, based on missingness
Enter 1/2:
I would use: 2 but they are times I use: 1
With threshold selected, 15 individual(s) blacklisted
Written in the directory: blacklist.id.similar.tsv
Blacklisted individuals: 15 ind.
Filtering with blacklist of individuals
################################### RESULTS ####################################
Detect duplicate genomes: 0.25
Number of individuals / strata / chrom / locus / SNP:
Before: 484 / 1 / 1 / 17569 / 19065
Blacklisted: 15 / 0 / 0 / 0 / 0
After: 469 / 1 / 1 / 17569 / 19065
Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 0 / 64 / 74
Computation time, overall: 171 sec
###################### completed detect_duplicate_genomes ######################
**the files in the folders with .tsv
can be open in e.g. excel and you can easily check the samples that are considered duplicates...
The samples above 0.75 around the 0.50 line are probably close kin ... but I would clean my dataset first with filter_rad...
email me if you have questions, open the issue if you still have problems with radiator...
Describe the bug
Hi Thierry, thanks for RADIATOR. I want to filter a Dart SNP mapping file using filter_rad() but I encounter this error and cant proceed:
function filter_rad() fails with error: "Error in extract_coverage(gds, markers = FALSE) : object 'coverage.info' not found"
I saw that you recently addressed a similar error "* bug fix using DArT data" and I believe I'm using the latest version of radiator (1.2.2) so not sure if its a related problem.
Thanks for your help.
To Reproduce
str_File_Name_METADATA = 'popmap_no_reps_by_min_missing_loci_perc_wo_2_inds__POP_STRATA_v2.tsv' str_File_Name_DATA = 'Report_DSha21-6120_SNP_mapping_2_DCB_w_4D_Sample_Code.csv'
filter_rad( data = str_File_Name_DATA, strata = str_File_Name_METADATA, interactive.filter = TRUE, output = NULL, filename = NULL, verbose = TRUE, parallel.core = parallel::detectCores() - 1 )
Number of chrom: 1 Number of locus: 18882 Number of SNPs: 20675 Number of strata: 1 Number of individuals: 519
Number of ind/strata: NSW = 519
Number of duplicate id: 0
Computation time, overall: 29 sec ################################################################################ #################### radiator::filter_dart_reproducibility ##################### ################################################################################ Execution date@time: 20220117@0928 Function call and arguments stored in: radiator_filter_dart_reproducibility_args_20220117@0928.tsv
Interactive mode: on 2 steps to visualize and filter the data based on reproducibility: Step 1. Visualization Step 2. Choose the filtering threshold
File written: dart_reproducibility_stats.tsv
File written: dart_reproducibility_boxplot_20220117@0928.pdf Generating helper table... Files written: helper tables and plots
Step 2. Filtering markers based on markers reproducibility
Do you still want to blacklist markers? (y/n): n
Computation time, overall: 11 sec #################### completed filter_dart_reproducibility ##################### ################################################################################ ######################### radiator::filter_monomorphic ######################### ################################################################################ Execution date@time: 20220117@0928 Function call and arguments stored in: radiator_filter_monomorphic_args_20220117@0928.tsv File written: whitelist.polymorphic.markers_20220117@0928.tsv
################################### RESULTS ####################################
Filter monomorphic markers Number of individuals / strata / chrom / locus / SNP: Before: 519 / 1 / 1 / 18882 / 20675 Blacklisted: 0 / 0 / 0 / 0 / 0 After: 519 / 1 / 1 / 18882 / 20675
Computation time, overall: 1 sec ######################### completed filter_monomorphic ######################### ################################################################################ ####################### radiator::filter_common_markers ######################## ################################################################################ Execution date@time: 20220117@0928 Function call and arguments stored in: radiator_filter_common_markers_args_20220117@0928.tsv Scanning for common markers... Filter common markers: only 1 strata, returning data
Computation time, overall: 0 sec ####################### completed filter_common_markers ######################## ################################################################################ ######################### radiator::filter_individuals ######################### ################################################################################ Execution date@time: 20220117@0928 Function call and arguments stored in: radiator_filter_individuals_args_20220117@0928.tsv Interactive mode: on
Step 1. Visualization Step 2. Missingness Step 3. Heterozygosity Step 4. Coverage (if available)
Step 1. Visualization of samples QC
Error in extract_coverage(gds, markers = FALSE) : object 'coverage.info' not found In addition: There were 50 or more warnings (use warnings() to see the first 50)
Computation time, overall: 0 sec ######################### completed filter_individuals #########################
Computation time, overall: 43 sec ############################# completed filter_rad #############################
the output of
devtools::session_info()
Session info ------------------------------------------------------------------------------------------------------ setting value
version R version 4.1.1 (2021-08-10) os Windows 10 x64
system x86_64, mingw32
ui RStudio
language (EN)
collate English_Australia.1252
ctype English_Australia.1252
tz Australia/Sydney
date 2022-01-17
Packages ---------------------------------------------------------------------------------------------------------- package version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
BiocGenerics 0.40.0 2021-10-26 [1] Bioconductor
BiocManager 1.30.16 2021-06-15 [1] CRAN (R 4.1.0)
Biostrings 2.62.0 2021-10-26 [1] Bioconductor
bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0)
broom 0.7.11 2022-01-03 [1] CRAN (R 4.1.2)
cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0)
callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.2)
colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.2)
curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0)
data.table 1.14.2 2021-09-27 [1] CRAN (R 4.1.2)
DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.2)
desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0)
devtools 2.4.2 2021-06-07 [1] CRAN (R 4.1.0)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.1)
farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
fst 0.9.4 2020-08-27 [1] CRAN (R 4.1.0)
gdsfmt 1.30.0 2021-10-26 [1] Bioconductor
generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.2)
GenomeInfoDb 1.30.0 2021-10-26 [1] Bioconductor
GenomeInfoDbData 1.2.7 2022-01-16 [1] Bioconductor
GenomicRanges 1.46.1 2021-11-18 [1] Bioconductor
ggplot2 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.2)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
HardyWeinberg 1.7.4 2021-11-26 [1] CRAN (R 4.1.2)
hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.2)
IRanges 2.28.0 2021-10-26 [1] Bioconductor
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.1)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.2)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0)
mice 3.14.0 2021-11-24 [1] CRAN (R 4.1.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
nnet 7.3-16 2021-05-03 [1] CRAN (R 4.1.1)
pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.2)
pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0)
plyr 1.8.6 2020-03-03 [1] CRAN (R 4.1.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2)
radiator 1.2.2 2022-01-16 [1] Github (thierrygosselin/radiator@6efdf14) ragg 1.2.0 2021-10-30 [1] CRAN (R 4.1.1)
Rcpp 1.0.8 2022-01-13 [1] CRAN (R 4.1.2)
RCurl 1.98-1.5 2021-09-17 [1] CRAN (R 4.1.1)
readr 2.1.1 2021-11-30 [1] CRAN (R 4.1.2)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.2)
rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.2)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
Rsolnp 1.16 2015-12-28 [1] CRAN (R 4.1.0)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
S4Vectors 0.32.3 2021-11-21 [1] Bioconductor
scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
SeqArray 1.34.0 2021-10-26 [1] Bioconductor
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
SNPRelate 1.26.0 2021-05-19 [1] Bioconductor
stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
systemfonts 1.0.3 2021-10-13 [1] CRAN (R 4.1.1)
testthat 3.0.4 2021-07-01 [1] CRAN (R 4.1.0)
textshaping 0.3.6 2021-10-13 [1] CRAN (R 4.1.1)
tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2)
tidyr 1.1.4 2021-09-27 [1] CRAN (R 4.1.2)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
truncnorm 1.0-8 2018-02-27 [1] CRAN (R 4.1.0)
tzdb 0.2.0 2021-10-27 [1] CRAN (R 4.1.2)
UpSetR 1.4.0 2019-05-22 [1] CRAN (R 4.1.0)
usethis 2.0.1 2021-02-10 [1] CRAN (R 4.1.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
vroom 1.5.7 2021-11-30 [1] CRAN (R 4.1.2)
withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.2)
XVector 0.34.0 2021-10-26 [1] Bioconductor
zlibbioc 1.40.0 2021-10-26 [1] Bioconductor
[1] M:/ENVI/installed/R/R-4.1.1/library
popmap_no_reps_by_min_missing_loci_perc_wo_2_inds__POP_STRATA_v2.zip
Report_DSha21-6120_SNP_mapping_2_DCB_w_4D_Sample_Code.zip