thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
59 stars 23 forks source link

sexy_markers operator is invalid for atomic vectors and Read Depth #187

Closed emilyostrow closed 6 months ago

emilyostrow commented 6 months ago

Hi Thierry,

I'm running into similar issues as the people who opened https://github.com/thierrygosselin/radiator/issues/183

I exported a VCF from STACKS 2 that was unfiltered and wanted to test it out with sexy_markers. I have read that you suggest some filtering prior to this program, but I am just trying to get it to work.

Radiator can import the data when just reading the data using test <- radiator::read_vcf(data="CPM_sexed_STACKS_unfiltered.vcf", strata = "CPM_inds_Knownsex.txt")

But sexy markers specifically is having issues. I believe this is with regard to interpreting depth in the VCF file. *note I did not filter the data to keep things as consistent as possible, but I would filter during an actual analysis.

CPM_sex_linked <- sexy_markers("CPM_sexed_STACKS_unfiltered.vcf", strata = "CPM_inds_Knownsex.txt")

`################################################################################ ############################ radiator::sexy_markers ############################ ################################################################################ Execution date@time: 20240429@1536 Folder created: sexy_markers_20240429@1536 File written: radiator_sexy_markers_args_20240429@1536.tsv
✔ Reading VCF [1s]

Cleaning VCF

Filter monomorphic markers Number of individuals / strata / chrom / locus / SNP: Blacklisted: 0 / 0 / 6 / 3782 / 7041

VCF statistics per individuals and markers Generating statistics ✔ SNPs per locus [120ms] ✔ SNP position on the read [148ms] ✔ Missing genotypes [294ms] ✔ Heterozygosity [388ms] ✔ MAC [261ms] ✔ Coverage ... [2.5s] Note: mean individual coverage SD = 0 correlation with total coverage is not possible ✔ Generating figures [8.4s] ✔ Writing files [1.7s]

VCF info: Number of chromosome/contig/scaffold: 32 Number of locus: 3243 Number of markers: 5321 Number of strata: 2 Number of individuals: 34

Number of ind/strata: F = 20 M = 14

Number of duplicate id: 0 radiator Genomic Data Structure (GDS) file: radiator_20240429@1536.gds

Filter monomorphic markers Number of individuals / strata / chrom / locus / SNP: Blacklisted: 0 / 0 / 0 / 0 / 0 ################################################################################ ############################# radiator::filter_ld ############################## ################################################################################ Execution date@time: 20240429@1536 ################################################################################ ######################### radiator::filter_individuals ######################### ################################################################################ Execution date@time: 20240429@1536

Unknowned arguments identified inside "...": dp

Read documentation, for latest changes, and modify your codes!

Function call and arguments stored in: radiator_filter_individuals_args_20240429@1536.tsv Interactive mode: on

Step 1. Visualization Step 2. Missingness Step 3. Heterozygosity Step 4. Coverage (if available)

Filters parameters file generated: filters_parameters_20240429@1536.tsv

Step 1. Visualization of samples QC

Generating statistics ✔ Missing genotypes [221ms] ✔ Heterozygosity [253ms] ✔ Coverage ... [260ms] Note: mean individual coverage SD = 0 correlation with total coverage is not possible ✔ Generating figures [2.7s] ✔ Writing files [486ms]

Step 2. Filtering markers based individual missingness/genotyping

Do you want to blacklist samples based on missingness ? (y/n): n

Step 3. Filtering markers based on individual heterozygosity

Do you want to blacklist samples based on heterozygosity ? (y/n): n

Step 4. Filtering markers based on individual's coverage

Do you want to blacklist samples based on TOTAL coverage ? (y/n): n Do you want to blacklist samples based on MEDIAN coverage ? (y/n): n Do you want to blacklist samples based on Interquartile Range (IQR) coverage ? (y/n): n

Computation time, overall: 13 sec ######################### completed filter_individuals ######################### Function call and arguments stored in: radiator_filter_ld_args_20240429@1536.tsv

Interactive mode: on

Step 1. Short distance LD threshold selection Step 2. Filtering markers based on short distance LD Step 3. Long distance LD pruning selection Step 4. Threshold selection Step 5. Filtering markers based on long distance LD

Filters parameters file generated: filters_parameters_20240429@1536.tsv Minimizing short distance LD... The range in the number of SNP/locus is: 1-25

Step 1. Short distance LD threshold selection the goal is to keep only 1 SNP per read/locus Choose the filter.short.ld threshold Options include: 1: mac (Not sure ? use mac...) 2: random 3: first 4: middle 5: last 1

Step 2. Filtering markers based on short distance LD filter.short.ld = mac File written: whitelist.short.ld.tsv
File written: blacklist.short.ld.tsv ################################### RESULTS ####################################

Filter short ld threshold: mac Number of individuals / strata / chrom / locus / SNP: Before: 34 / 2 / 32 / 3243 / 5321 Blacklisted: 0 / 0 / 0 / 0 / 2078 After: 34 / 2 / 32 / 3243 / 3243

Do you want to continue filtering using long distance ld ? (y/n): n

Computation time, overall: 16 sec ############################# completed filter_ld ##############################

Sex-ratio (F/M): 1.43 Error in SeqArray::seqGetData(data, "annotation/format/DP")$data : $ operator is invalid for atomic vectors In addition: Warning messages: 1: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 2: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 3: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 4: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 5: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 6: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 7: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 8: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values. 9: In ggplot2::scale_y_log10(labels = scales::number_format(), oob = scales::squish_infinite) : log-10 transformation introduced infinite values.

Computation time, overall: 34 sec ############################ completed sexy_markers ############################`

I just updated everything to make sure that it was all working together and I didn't miss a bug fix. Here is my session info:

`R version 4.4.0 (2024-04-24) Platform: x86_64-apple-darwin20 Running under: macOS Ventura 13.6.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Denver tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] SeqArray_1.43.8 gdsfmt_1.39.3 SNPfiltR_1.0.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[8] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 radiator_1.3.1 [15] vcfR_1.15.0

loaded via a namespace (and not attached): [1] gtable_0.3.5 lattice_0.22-6 tzdb_0.4.0 vctrs_0.6.5
[5] tools_4.4.0 generics_0.1.3 stats4_4.4.0 parallel_4.4.0
[9] fansi_1.0.6 cluster_2.1.6 pkgconfig_2.0.3 Matrix_1.7-0
[13] S4Vectors_0.41.7 lifecycle_1.0.4 GenomeInfoDbData_1.2.12 farver_2.1.1
[17] compiler_4.4.0 textshaping_0.3.7 Biostrings_2.71.6 munsell_0.5.1
[21] permute_0.9-7 GenomeInfoDb_1.39.14 pillar_1.9.0 crayon_1.5.2
[25] MASS_7.3-60.2 vegan_2.6-4 nlme_3.1-164 tidyselect_1.2.1
[29] digest_0.6.35 stringi_1.8.3 labeling_0.4.3 splines_4.4.0
[33] grid_4.4.0 colorspace_2.1-0 cli_3.6.2 magrittr_2.0.3
[37] utf8_1.2.4 ape_5.8 withr_3.0.0 scales_1.3.0
[41] UCSC.utils_0.99.7 bit64_4.0.5 timechange_0.3.0 XVector_0.43.1
[45] httr_1.4.7 matrixStats_1.3.0 bit_4.0.5 gridExtra_2.3
[49] ragg_1.3.0 hms_1.1.3 UpSetR_1.4.0 GenomicRanges_1.55.4
[53] IRanges_2.37.1 viridisLite_0.4.2 mgcv_1.9-1 rlang_1.1.3
[57] Rcpp_1.0.12 pinfsc50_1.3.0 glue_1.7.0 BiocGenerics_0.49.1
[61] vroom_1.6.5 rstudioapi_0.16.0 jsonlite_1.8.8 plyr_1.8.9
[65] R6_2.5.1 systemfonts_1.0.6 zlibbioc_1.49.3 `

Thanks for your help!

thierrygosselin commented 6 months ago

Sorry about the issue Emily, I'll have a look at the bug tomorrow.

thierrygosselin commented 6 months ago

On it this morning...

thierrygosselin commented 6 months ago

I have read that you suggest some filtering prior to this program, but I am just trying to get it to work.

Yes otherwise, it will be just non sense.

thierrygosselin commented 6 months ago

Using read_vcf you probably observed that you have duplicated markers on different strands, those should definitely go

thierrygosselin commented 6 months ago
Filter duplicated markers on different strands: blacklist
Number of individuals / strata / chrom / locus / SNP:
    Before: 34 / 2 / 38 / 7034 / 12375
    Blacklisted: 0 / 0 / 0 / 9 / 13
    After: 34 / 2 / 38 / 7025 / 12362
thierrygosselin commented 6 months ago

The next part are automatic in read_vcf but can easily be turned off when you read the doc

monomorphic markers are discarded

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 6 / 3782 / 7041

You have quite a few, so you probably removed individuals from the strata or both strata and VCF

Unless you're doing fancy analysis, it will keep markers in common among your groupings

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 0 / 28 / 62

You end up with this VCF

VCF info:
Number of chromosome/contig/scaffold: 32
Number of locus: 3215
Number of markers: 5259
Number of strata: 2
Number of individuals: 34

Number of ind/strata:
F = 20
M = 14
thierrygosselin commented 6 months ago

The function should work properly with VCF with the latest push, however I don't think it's appropriate or safe to use the results.

The dataset you sent was really low coverage and as other problems, again read the function doc for more details on assumptions and prerequisites.

emilyostrow commented 6 months ago

Thanks for taking a look at this. I appreciate the heads up. This is just a first step to see if we can work with the data we already have.