thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
59 stars 23 forks source link

Stacks haplotype VCF 'tidy_data' warnings - #67

Closed AvdReis closed 4 years ago

AvdReis commented 4 years ago

Hi Thierry

I think the issue I am having is similar to https://github.com/thierrygosselin/radiator/issues/11

I am trying to tidy my Stacks haplotype VCF file and it is resulting in warnings.

`tidy_genomic_data("./populations.haps.vcf", strata = "strata.tsv", parallel.core = 1, filter.common.markers = FALSE, filter.monomorphic = FALSE) ################################################################################ ######################### radiator::tidy_genomic_data ########################## ################################################################################ Execution date/time: 20191119@0128 Folder created: 13_radiator_tidy_genomic_20191119@0128 Function call and arguments stored in: radiator_tidy_genomic_data_args_20191119@0128.tsv Analyzing strata file Number of strata: 5 Number of individuals: 87 Importing and tidying the VCF... Execution date@time: 20191119@0128

Reading VCF Data summary: number of samples: 87 number of markers: 1054 done! timing: 8 sec

Generating individual stats... [==================================================] 100%, completed, 0s Generating markers stats... [==================================================] 100%, completed, 0s [==================================================] 100%, completed, 0s

Number of chromosome/contig/scaffold: 1 Number of locus: 1054 Number of markers: 1054 Number of populations: 5 Number of individuals: 87

Number of ind/pop: East_LNI = 18 East_UNI = 18 WC = 19 CR = 15 AI = 17

Number of duplicate id: 0 radiator Genomic Data Structure (GDS) file: radiator_20191119@0128.gds File written: random.seed (171765)

Genotypes formats generated with 1054 SNPs: GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): TRUE GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): TRUE GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): TRUE GT (the genotype coding 'a la genepop': 001002, 001001, 000000): TRUE

Keeping vcf genotypes metadata: yes genotypes metadata:

Calculating REF/ALT alleles... Calibrating REF/ALT alleles... generating REF/ALT dictionary number of REF/ALT switch = 39 integrating genotypes codings... Synchronizing data and strata... Number of strata: 5 Number of individuals: 87 Updating GDS with genotypes.meta values

Tidy data file written: radiator_20191119@0131.rad

Computation time, overall: 162 sec ############################## completed tidy_vcf ############################## Erasing genotype: no Filters parameters file generated: filters_parameters_20191119@0128.tsv Filters parameters file: initiated ################################### RESULTS #################################### Tidy data written in global environment Data format: vcf.file Multiallelic data

Tidy genomic data: Number of markers: 1054 Number of chromosome/contig/scaffold: 1 Number of strata: 5 Number of individuals: 87

Computation time, overall: 164 sec ######################### tidy_genomic_data completed ########################## There were 21 warnings (use warnings() to see them)

warnings() Warning messages: 1: In matrix(data = ., nrow = n.markers, ncol = 2, byrow = TRUE, ... : data length [3568] is not a sub-multiple or multiple of the number of rows [1054] 2: In min(x, na.rm = TRUE) : no non-missing arguments to min; returning Inf 3: In max(x, na.rm = TRUE) : no non-missing arguments to max; returning -Inf 4: In min(x, na.rm = TRUE) : no non-missing arguments to min; returning Inf 5: In max(x, na.rm = TRUE) : no non-missing arguments to max; returning -Inf 6: In min(x, na.rm = TRUE) : no non-missing arguments to min; returning Inf 7: In max(x, na.rm = TRUE) : no non-missing arguments to max; returning -Inf 8: In min(x, na.rm = TRUE) : no non-missing arguments to min; returning Inf 9: In max(x, na.rm = TRUE) : no non-missing arguments to max; returning -Inf 10: Removed 4 rows containing missing values (geom_boxplot). 11: Removed 4 rows containing missing values (geom_segment). 12: Removed 4 rows containing missing values (geom_segment). 13: Removed 4 rows containing missing values (geom_boxplot). 14: Removed 4 rows containing missing values (geom_segment). 15: Removed 4 rows containing missing values (geom_segment). 16: In serialize(data, node$con) : 'package:stats' may not be available when loading 17: In serialize(data, node$con) : 'package:stats' may not be available when loading 18: In serialize(data, node$con) : 'package:stats' may not be available when loading 19: In serialize(data, node$con) : 'package:stats' may not be available when loading 20: In serialize(data, node$con) : 'package:stats' may not be available when loading 21: In serialize(data, node$con) : 'package:stats' may not be available when loading
`

Session info - I have just updated R to update packages as I thought that may have been a problem, but it only got rid of 6 warnings...

`R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362)

Matrix products: default

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] radiator_1.1.2

loaded via a namespace (and not attached): [1] nlme_3.1-140 bitops_1.0-6 fs_1.3.1
[4] usethis_1.5.1 devtools_2.2.1 rprojroot_1.3-2
[7] GenomeInfoDb_1.20.0 tools_3.6.1 backports_1.1.5
[10] R6_2.4.1 rpart_4.1-15 BiocGenerics_0.30.0
[13] lazyeval_0.2.2 mgcv_1.8-28 colorspace_1.4-1
[16] jomo_2.6-10 nnet_7.3-12 withr_2.1.2
[19] tidyselect_0.2.5 prettyunits_1.0.2 processx_3.4.1
[22] curl_4.2 compiler_3.6.1 cli_1.1.0
[25] Biobase_2.44.0 mice_3.6.0 desc_1.2.0
[28] labeling_0.3 scales_1.0.0 readr_1.3.1
[31] callr_3.3.2 stringr_1.4.0 digest_0.6.22
[34] minqa_1.2.4 GWASExactHW_1.01 XVector_0.24.0
[37] pkgconfig_2.0.3 fst_0.9.0 lme4_1.1-21
[40] sessioninfo_1.1.1 rlang_0.4.1 rstudioapi_0.10
[43] generics_0.0.2 dplyr_0.8.3 RCurl_1.95-4.12
[46] magrittr_1.5 GenomeInfoDbData_1.2.1 Matrix_1.2-17
[49] Rcpp_1.0.3 munsell_0.5.0 S4Vectors_0.22.1
[52] lifecycle_0.1.0 stringi_1.4.3 MASS_7.3-51.4
[55] zlibbioc_1.30.0 plyr_1.8.4 pkgbuild_1.0.6
[58] grid_3.6.1 parallel_3.6.1 mitml_0.3-7
[61] crayon_1.3.4 lattice_0.20-38 Biostrings_2.52.0
[64] splines_3.6.1 hms_0.5.2 zeallot_0.1.0
[67] ps_1.3.0 pillar_1.4.2 GenomicRanges_1.36.1
[70] boot_1.3-22 logistf_1.23 reshape2_1.4.3
[73] gdsfmt_1.20.0 stats4_3.6.1 pkgload_1.0.2
[76] pan_1.6 glue_1.3.1 data.table_1.12.6
[79] remotes_2.1.0 vctrs_0.2.0 nloptr_1.2.1
[82] testthat_2.3.0 gtable_0.3.0 purrr_0.3.3
[85] tidyr_1.0.0 SeqArray_1.24.2 assertthat_0.2.1
[88] ggplot2_3.2.1 broom_0.5.2 SeqVarTools_1.22.0
[91] survival_2.44-1.1 tibble_2.1.3 memoise_1.1.0
[94] IRanges_2.18.3 ellipsis_0.3.0 `

Your help is really appreciated! Cheers Aimee

data.zip

thierrygosselin commented 4 years ago

Dear Aimee I’ll have a look at the problem this afternoon Thanks for highlighting the bug

Thierry

AvdReis commented 4 years ago

Hi Thierry

I thought I would add this too.

I tried to tidy the Stacks SNPs VCF file and got warnings too...

`tidy_genomic_data("./populations.snps.vcf", strata = "strata.tsv", parallel.core = 1, filter.common.markers = FALSE, filter.monomorphic = FALSE) ################################################################################ ######################### radiator::tidy_genomic_data ########################## ################################################################################ Execution date/time: 20191119@0743 Folder created: 15_radiator_tidy_genomic_20191119@0743 Function call and arguments stored in: radiator_tidy_genomic_data_args_20191119@0743.tsv Analyzing strata file Number of strata: 5 Number of individuals: 87 Importing and tidying the VCF... Execution date@time: 20191119@0743

Reading VCF Data summary: number of samples: 87 number of markers: 2499 done! timing: 15 sec

Generating individual stats... [==================================================] 100%, completed, 0s Generating markers stats... [==================================================] 100%, completed, 0s [==================================================] 100%, completed, 0s

Number of chromosome/contig/scaffold: 1 Number of locus: 1054 Number of markers: 2499 Number of populations: 5 Number of individuals: 87

Number of ind/pop: East_LNI = 18 East_UNI = 18 West_Coast = 19 Chatham_Rise = 15 Auckland_Islands = 17

Number of duplicate id: 0 radiator Genomic Data Structure (GDS) file: radiator_20191119@0743.gds File written: random.seed (596679)

Genotypes formats generated with 2499 SNPs: GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): TRUE GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): TRUE GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): TRUE GT (the genotype coding 'a la genepop': 001002, 001001, 000000): TRUE

Keeping vcf genotypes metadata: yes genotypes metadata: AD, DP, HQ, GL, GQ

Parsing and tidying: AD AD column: splitting into ALLELE_REF_DEPTH and ALLELE_ALT_DEPTH

Parsing and tidying: DP DP column: cleaning and renaming to READ_DEPTH

Parsing and tidying: HQ HQ values are all missing: removing column

Parsing and tidying: GL GL column: cleaning Genotype Likelihood column

Parsing and tidying: GQ GQ column: Genotype Quality

Stacks problem detected missing allele depth info number of genotypes with problem: 1 (prop: 0) correcting problem by adding the read depth info into AD fields...

Calculating REF/ALT alleles... Calibrating REF/ALT alleles... generating REF/ALT dictionary number of REF/ALT switch = 0 integrating genotypes codings... Synchronizing data and strata... Number of strata: 5 Number of individuals: 87 Updating GDS with genotypes.meta values

Tidy data file written: radiator_20191119@0751.rad

Computation time, overall: 454 sec ############################## completed tidy_vcf ############################## Erasing genotype: no Filters parameters file generated: filters_parameters_20191119@0743.tsv Filters parameters file: initiated ################################### RESULTS #################################### Tidy data written in global environment Data format: vcf.file Biallelic data

Tidy genomic data: Number of markers: 2499 Number of chromosome/contig/scaffold: 1 Number of strata: 5 Number of individuals: 87

Computation time, overall: 458 sec ######################### tidy_genomic_data completed ########################## Warning messages: 1: In serialize(data, node$con) : 'package:stats' may not be available when loading 2: In serialize(data, node$con) : 'package:stats' may not be available when loading 3: In serialize(data, node$con) : 'package:stats' may not be available when loading 4: In serialize(data, node$con) : 'package:stats' may not be available when loading 5: In serialize(data, node$con) : 'package:stats' may not be available when loading 6: In serialize(data, node$con) : 'package:stats' may not be available when loading 7: In serialize(data, node$con) : 'package:stats' may not be available when loading 8: In serialize(data, node$con) : 'package:stats' may not be available when loading 9: In serialize(data, node$con) : 'package:stats' may not be available when loading 10: In serialize(data, node$con) : 'package:stats' may not be available when loading`

Cheers Aimee

thierrygosselin commented 4 years ago

Sorry about the long delay I had forgotten about this

thierrygosselin commented 4 years ago

I'm not able to reproduce the problem you are seeing with the latest radiator version. Try it again if this is still relevant. But before reopening the issue, try changing the names of your samples in the Strata file because it's a really really bad way to name samples and this might cause you downstream problems in other packages and software:

Samples id:

01.3
01.4
01.1
01.9
01.11
01.12
01.13
01.14
02.2
02.3
02.4
02.5
02.6
02.7
02.10
03.13
03.14
04.1
04.2
04.3
04.4
04.5
04.6
04.7
04.9
04.12
04.14
04.15
02.12
02.14
02.15
03.2
03.4
03.5
03.10
03.12
WC1.2
257.13
257.14
WC1.3
WC1.6
WC1.8
WC1.10
WC1.14
WC2.1
WC2.2
WC2.3
WC2.5
WC2.6
WC2.7
WC2.8
WC2.11
WC2.13
WC2.14
247.2
247.3
247.4
247.5
247.6
247.11
247.13
257.1
257.2
257.4
257.6
257.11
257.15
269.8
269.9
269.14
269.15
269.20
269.21
269.22
269.26
299.1
299.4
299.5
299.8
299.12
299.15
299.16
299.18
299.20
HALFred
WHOLEred
WHOLEblue
thierrygosselin commented 4 years ago

Try to generate consistent naming scheme: same length and don't mix types. e.g.SPECIES-POPULATION-MATURITY-YEAR-ID = CHI-QUE-ADU-2014-020. Thats just an example.

thierrygosselin commented 4 years ago

You could use a strata file in radiator::read_vcf to generate a GDS object with better names. The strata requires 3 columns: INDIVIDUALS, STRATA, NEW_ID. The NEW_ID column is the new sample names

thierrygosselin commented 4 years ago

An example with your dataset:

strata <- radiator::extract_individuals_vcf(data = "populations.haps.vcf") %>%
  dplyr::mutate(
    SEQ = seq(from = 1L, to = n()),
    SEQ = stringi::stri_pad(str = SEQ, side = "left", width = 3L, pad = 0L),
    STRATA = "POP1",
    NEW_ID = stringi::stri_join("POP-", SEQ),
    SEQ = NULL
  ) %>%
  readr::write_tsv(x = ., path = "strata.example.tsv")