Closed emilyostrow closed 6 months ago
Sorry about the issue Emily, I'll have a look at the bug tomorrow.
On it this morning...
I have read that you suggest some filtering prior to this program, but I am just trying to get it to work.
Yes otherwise, it will be just non sense.
Using read_vcf
you probably observed that you have duplicated markers on different strands, those should definitely go
Filter duplicated markers on different strands: blacklist
Number of individuals / strata / chrom / locus / SNP:
Before: 34 / 2 / 38 / 7034 / 12375
Blacklisted: 0 / 0 / 0 / 9 / 13
After: 34 / 2 / 38 / 7025 / 12362
The next part are automatic in read_vcf
but can easily be turned off when you read the doc
monomorphic markers are discarded
Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 6 / 3782 / 7041
You have quite a few, so you probably removed individuals from the strata or both strata and VCF
Unless you're doing fancy analysis, it will keep markers in common among your groupings
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 0 / 28 / 62
You end up with this VCF
VCF info:
Number of chromosome/contig/scaffold: 32
Number of locus: 3215
Number of markers: 5259
Number of strata: 2
Number of individuals: 34
Number of ind/strata:
F = 20
M = 14
The function should work properly with VCF with the latest push, however I don't think it's appropriate or safe to use the results.
The dataset you sent was really low coverage and as other problems, again read the function doc for more details on assumptions and prerequisites.
Thanks for taking a look at this. I appreciate the heads up. This is just a first step to see if we can work with the data we already have.
Hi Thierry,
I'm running into similar issues as the people who opened https://github.com/thierrygosselin/radiator/issues/183
I exported a VCF from STACKS 2 that was unfiltered and wanted to test it out with sexy_markers. I have read that you suggest some filtering prior to this program, but I am just trying to get it to work.
Radiator can import the data when just reading the data using
test <- radiator::read_vcf(data="CPM_sexed_STACKS_unfiltered.vcf", strata = "CPM_inds_Knownsex.txt")
But sexy markers specifically is having issues. I believe this is with regard to interpreting depth in the VCF file. *note I did not filter the data to keep things as consistent as possible, but I would filter during an actual analysis.
CPM_sex_linked <- sexy_markers("CPM_sexed_STACKS_unfiltered.vcf", strata = "CPM_inds_Knownsex.txt")
`################################################################################ ############################ radiator::sexy_markers ############################ ################################################################################ Execution date@time: 20240429@1536 Folder created: sexy_markers_20240429@1536 File written: radiator_sexy_markers_args_20240429@1536.tsv
✔ Reading VCF [1s]
Cleaning VCF
Filter monomorphic markers Number of individuals / strata / chrom / locus / SNP: Blacklisted: 0 / 0 / 6 / 3782 / 7041
VCF statistics per individuals and markers Generating statistics ✔ SNPs per locus [120ms] ✔ SNP position on the read [148ms] ✔ Missing genotypes [294ms] ✔ Heterozygosity [388ms] ✔ MAC [261ms] ✔ Coverage ... [2.5s] Note: mean individual coverage SD = 0 correlation with total coverage is not possible ✔ Generating figures [8.4s] ✔ Writing files [1.7s]
VCF info: Number of chromosome/contig/scaffold: 32 Number of locus: 3243 Number of markers: 5321 Number of strata: 2 Number of individuals: 34
Number of ind/strata: F = 20 M = 14
Number of duplicate id: 0 radiator Genomic Data Structure (GDS) file: radiator_20240429@1536.gds
Filter monomorphic markers Number of individuals / strata / chrom / locus / SNP: Blacklisted: 0 / 0 / 0 / 0 / 0 ################################################################################ ############################# radiator::filter_ld ############################## ################################################################################ Execution date@time: 20240429@1536 ################################################################################ ######################### radiator::filter_individuals ######################### ################################################################################ Execution date@time: 20240429@1536
Unknowned arguments identified inside "...": dp
Read documentation, for latest changes, and modify your codes!
Function call and arguments stored in: radiator_filter_individuals_args_20240429@1536.tsv Interactive mode: on
Step 1. Visualization Step 2. Missingness Step 3. Heterozygosity Step 4. Coverage (if available)
Filters parameters file generated: filters_parameters_20240429@1536.tsv
Step 1. Visualization of samples QC
Generating statistics ✔ Missing genotypes [221ms] ✔ Heterozygosity [253ms] ✔ Coverage ... [260ms] Note: mean individual coverage SD = 0 correlation with total coverage is not possible ✔ Generating figures [2.7s] ✔ Writing files [486ms]
Step 2. Filtering markers based individual missingness/genotyping
Do you want to blacklist samples based on missingness ? (y/n): n
Step 3. Filtering markers based on individual heterozygosity
Do you want to blacklist samples based on heterozygosity ? (y/n): n
Step 4. Filtering markers based on individual's coverage
Do you want to blacklist samples based on TOTAL coverage ? (y/n): n Do you want to blacklist samples based on MEDIAN coverage ? (y/n): n Do you want to blacklist samples based on Interquartile Range (IQR) coverage ? (y/n): n
Computation time, overall: 13 sec ######################### completed filter_individuals ######################### Function call and arguments stored in: radiator_filter_ld_args_20240429@1536.tsv
Interactive mode: on
Step 1. Short distance LD threshold selection Step 2. Filtering markers based on short distance LD Step 3. Long distance LD pruning selection Step 4. Threshold selection Step 5. Filtering markers based on long distance LD
Filters parameters file generated: filters_parameters_20240429@1536.tsv Minimizing short distance LD... The range in the number of SNP/locus is: 1-25
Step 1. Short distance LD threshold selection the goal is to keep only 1 SNP per read/locus Choose the filter.short.ld threshold Options include: 1: mac (Not sure ? use mac...) 2: random 3: first 4: middle 5: last 1
Step 2. Filtering markers based on short distance LD filter.short.ld = mac File written: whitelist.short.ld.tsv
File written: blacklist.short.ld.tsv ################################### RESULTS ####################################
Filter short ld threshold: mac Number of individuals / strata / chrom / locus / SNP: Before: 34 / 2 / 32 / 3243 / 5321 Blacklisted: 0 / 0 / 0 / 0 / 2078 After: 34 / 2 / 32 / 3243 / 3243
Do you want to continue filtering using long distance ld ? (y/n): n
Computation time, overall: 16 sec ############################# completed filter_ld ##############################
Sex-ratio (F/M): 1.43 Error in SeqArray::seqGetData(data, "annotation/format/DP")$data : $ operator is invalid for atomic vectors In addition: Warning messages: 1: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 2: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 3: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 4: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 5: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 6: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 7: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 8: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 9: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values.
Computation time, overall: 34 sec ############################ completed sexy_markers ############################`
I just updated everything to make sure that it was all working together and I didn't miss a bug fix. Here is my session info:
`R version 4.4.0 (2024-04-24) Platform: x86_64-apple-darwin20 Running under: macOS Ventura 13.6.6
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Denver tzcode source: internal
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] SeqArray_1.43.8 gdsfmt_1.39.3 SNPfiltR_1.0.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[8] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 radiator_1.3.1 [15] vcfR_1.15.0
loaded via a namespace (and not attached): [1] gtable_0.3.5 lattice_0.22-6 tzdb_0.4.0 vctrs_0.6.5
[5] tools_4.4.0 generics_0.1.3 stats4_4.4.0 parallel_4.4.0
[9] fansi_1.0.6 cluster_2.1.6 pkgconfig_2.0.3 Matrix_1.7-0
[13] S4Vectors_0.41.7 lifecycle_1.0.4 GenomeInfoDbData_1.2.12 farver_2.1.1
[17] compiler_4.4.0 textshaping_0.3.7 Biostrings_2.71.6 munsell_0.5.1
[21] permute_0.9-7 GenomeInfoDb_1.39.14 pillar_1.9.0 crayon_1.5.2
[25] MASS_7.3-60.2 vegan_2.6-4 nlme_3.1-164 tidyselect_1.2.1
[29] digest_0.6.35 stringi_1.8.3 labeling_0.4.3 splines_4.4.0
[33] grid_4.4.0 colorspace_2.1-0 cli_3.6.2 magrittr_2.0.3
[37] utf8_1.2.4 ape_5.8 withr_3.0.0 scales_1.3.0
[41] UCSC.utils_0.99.7 bit64_4.0.5 timechange_0.3.0 XVector_0.43.1
[45] httr_1.4.7 matrixStats_1.3.0 bit_4.0.5 gridExtra_2.3
[49] ragg_1.3.0 hms_1.1.3 UpSetR_1.4.0 GenomicRanges_1.55.4
[53] IRanges_2.37.1 viridisLite_0.4.2 mgcv_1.9-1 rlang_1.1.3
[57] Rcpp_1.0.12 pinfsc50_1.3.0 glue_1.7.0 BiocGenerics_0.49.1
[61] vroom_1.6.5 rstudioapi_0.16.0 jsonlite_1.8.8 plyr_1.8.9
[65] R6_2.5.1 systemfonts_1.0.6 zlibbioc_1.49.3 `
Thanks for your help!