Closed AvdReis closed 4 years ago
Dear Aimee I’ll have a look at the problem this afternoon Thanks for highlighting the bug
Thierry
Hi Thierry
I thought I would add this too.
I tried to tidy the Stacks SNPs VCF file and got warnings too...
`tidy_genomic_data("./populations.snps.vcf", strata = "strata.tsv", parallel.core = 1, filter.common.markers = FALSE, filter.monomorphic = FALSE) ################################################################################ ######################### radiator::tidy_genomic_data ########################## ################################################################################ Execution date/time: 20191119@0743 Folder created: 15_radiator_tidy_genomic_20191119@0743 Function call and arguments stored in: radiator_tidy_genomic_data_args_20191119@0743.tsv Analyzing strata file Number of strata: 5 Number of individuals: 87 Importing and tidying the VCF... Execution date@time: 20191119@0743
Reading VCF Data summary: number of samples: 87 number of markers: 2499 done! timing: 15 sec
Generating individual stats... [==================================================] 100%, completed, 0s Generating markers stats... [==================================================] 100%, completed, 0s [==================================================] 100%, completed, 0s
Number of chromosome/contig/scaffold: 1 Number of locus: 1054 Number of markers: 2499 Number of populations: 5 Number of individuals: 87
Number of ind/pop: East_LNI = 18 East_UNI = 18 West_Coast = 19 Chatham_Rise = 15 Auckland_Islands = 17
Number of duplicate id: 0 radiator Genomic Data Structure (GDS) file: radiator_20191119@0743.gds File written: random.seed (596679)
Genotypes formats generated with 2499 SNPs: GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): TRUE GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): TRUE GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): TRUE GT (the genotype coding 'a la genepop': 001002, 001001, 000000): TRUE
Keeping vcf genotypes metadata: yes genotypes metadata: AD, DP, HQ, GL, GQ
Parsing and tidying: AD AD column: splitting into ALLELE_REF_DEPTH and ALLELE_ALT_DEPTH
Parsing and tidying: DP DP column: cleaning and renaming to READ_DEPTH
Parsing and tidying: HQ HQ values are all missing: removing column
Parsing and tidying: GL GL column: cleaning Genotype Likelihood column
Parsing and tidying: GQ GQ column: Genotype Quality
Stacks problem detected missing allele depth info number of genotypes with problem: 1 (prop: 0) correcting problem by adding the read depth info into AD fields...
Calculating REF/ALT alleles... Calibrating REF/ALT alleles... generating REF/ALT dictionary number of REF/ALT switch = 0 integrating genotypes codings... Synchronizing data and strata... Number of strata: 5 Number of individuals: 87 Updating GDS with genotypes.meta values
Tidy data file written: radiator_20191119@0751.rad
Computation time, overall: 454 sec ############################## completed tidy_vcf ############################## Erasing genotype: no Filters parameters file generated: filters_parameters_20191119@0743.tsv Filters parameters file: initiated ################################### RESULTS #################################### Tidy data written in global environment Data format: vcf.file Biallelic data
Tidy genomic data: Number of markers: 2499 Number of chromosome/contig/scaffold: 1 Number of strata: 5 Number of individuals: 87
Computation time, overall: 458 sec ######################### tidy_genomic_data completed ########################## Warning messages: 1: In serialize(data, node$con) : 'package:stats' may not be available when loading 2: In serialize(data, node$con) : 'package:stats' may not be available when loading 3: In serialize(data, node$con) : 'package:stats' may not be available when loading 4: In serialize(data, node$con) : 'package:stats' may not be available when loading 5: In serialize(data, node$con) : 'package:stats' may not be available when loading 6: In serialize(data, node$con) : 'package:stats' may not be available when loading 7: In serialize(data, node$con) : 'package:stats' may not be available when loading 8: In serialize(data, node$con) : 'package:stats' may not be available when loading 9: In serialize(data, node$con) : 'package:stats' may not be available when loading 10: In serialize(data, node$con) : 'package:stats' may not be available when loading`
Cheers Aimee
Sorry about the long delay I had forgotten about this
I'm not able to reproduce the problem you are seeing with the latest radiator version. Try it again if this is still relevant. But before reopening the issue, try changing the names of your samples in the Strata file because it's a really really bad way to name samples and this might cause you downstream problems in other packages and software:
Samples id:
01.3
01.4
01.1
01.9
01.11
01.12
01.13
01.14
02.2
02.3
02.4
02.5
02.6
02.7
02.10
03.13
03.14
04.1
04.2
04.3
04.4
04.5
04.6
04.7
04.9
04.12
04.14
04.15
02.12
02.14
02.15
03.2
03.4
03.5
03.10
03.12
WC1.2
257.13
257.14
WC1.3
WC1.6
WC1.8
WC1.10
WC1.14
WC2.1
WC2.2
WC2.3
WC2.5
WC2.6
WC2.7
WC2.8
WC2.11
WC2.13
WC2.14
247.2
247.3
247.4
247.5
247.6
247.11
247.13
257.1
257.2
257.4
257.6
257.11
257.15
269.8
269.9
269.14
269.15
269.20
269.21
269.22
269.26
299.1
299.4
299.5
299.8
299.12
299.15
299.16
299.18
299.20
HALFred
WHOLEred
WHOLEblue
Try to generate consistent naming scheme: same length and don't mix types. e.g.SPECIES-POPULATION-MATURITY-YEAR-ID = CHI-QUE-ADU-2014-020. Thats just an example.
You could use a strata file in radiator::read_vcf
to generate a GDS object with better names.
The strata requires 3 columns: INDIVIDUALS, STRATA, NEW_ID
.
The NEW_ID
column is the new sample names
An example with your dataset:
strata <- radiator::extract_individuals_vcf(data = "populations.haps.vcf") %>%
dplyr::mutate(
SEQ = seq(from = 1L, to = n()),
SEQ = stringi::stri_pad(str = SEQ, side = "left", width = 3L, pad = 0L),
STRATA = "POP1",
NEW_ID = stringi::stri_join("POP-", SEQ),
SEQ = NULL
) %>%
readr::write_tsv(x = ., path = "strata.example.tsv")
Hi Thierry
I think the issue I am having is similar to https://github.com/thierrygosselin/radiator/issues/11
I am trying to tidy my Stacks haplotype VCF file and it is resulting in warnings.
`tidy_genomic_data("./populations.haps.vcf", strata = "strata.tsv", parallel.core = 1, filter.common.markers = FALSE, filter.monomorphic = FALSE) ################################################################################ ######################### radiator::tidy_genomic_data ########################## ################################################################################ Execution date/time: 20191119@0128 Folder created: 13_radiator_tidy_genomic_20191119@0128 Function call and arguments stored in: radiator_tidy_genomic_data_args_20191119@0128.tsv Analyzing strata file Number of strata: 5 Number of individuals: 87 Importing and tidying the VCF... Execution date@time: 20191119@0128
Reading VCF Data summary: number of samples: 87 number of markers: 1054 done! timing: 8 sec
Generating individual stats... [==================================================] 100%, completed, 0s Generating markers stats... [==================================================] 100%, completed, 0s [==================================================] 100%, completed, 0s
Number of chromosome/contig/scaffold: 1 Number of locus: 1054 Number of markers: 1054 Number of populations: 5 Number of individuals: 87
Number of ind/pop: East_LNI = 18 East_UNI = 18 WC = 19 CR = 15 AI = 17
Number of duplicate id: 0 radiator Genomic Data Structure (GDS) file: radiator_20191119@0128.gds File written: random.seed (171765)
Genotypes formats generated with 1054 SNPs: GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): TRUE GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): TRUE GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): TRUE GT (the genotype coding 'a la genepop': 001002, 001001, 000000): TRUE
Keeping vcf genotypes metadata: yes genotypes metadata:
Calculating REF/ALT alleles... Calibrating REF/ALT alleles... generating REF/ALT dictionary number of REF/ALT switch = 39 integrating genotypes codings... Synchronizing data and strata... Number of strata: 5 Number of individuals: 87 Updating GDS with genotypes.meta values
Tidy data file written: radiator_20191119@0131.rad
Computation time, overall: 162 sec ############################## completed tidy_vcf ############################## Erasing genotype: no Filters parameters file generated: filters_parameters_20191119@0128.tsv Filters parameters file: initiated ################################### RESULTS #################################### Tidy data written in global environment Data format: vcf.file Multiallelic data
Tidy genomic data: Number of markers: 1054 Number of chromosome/contig/scaffold: 1 Number of strata: 5 Number of individuals: 87
Computation time, overall: 164 sec ######################### tidy_genomic_data completed ########################## There were 21 warnings (use warnings() to see them)
Session info - I have just updated R to update packages as I thought that may have been a problem, but it only got rid of 6 warnings...
`R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362)
Matrix products: default
Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] radiator_1.1.2
loaded via a namespace (and not attached): [1] nlme_3.1-140 bitops_1.0-6 fs_1.3.1
[4] usethis_1.5.1 devtools_2.2.1 rprojroot_1.3-2
[7] GenomeInfoDb_1.20.0 tools_3.6.1 backports_1.1.5
[10] R6_2.4.1 rpart_4.1-15 BiocGenerics_0.30.0
[13] lazyeval_0.2.2 mgcv_1.8-28 colorspace_1.4-1
[16] jomo_2.6-10 nnet_7.3-12 withr_2.1.2
[19] tidyselect_0.2.5 prettyunits_1.0.2 processx_3.4.1
[22] curl_4.2 compiler_3.6.1 cli_1.1.0
[25] Biobase_2.44.0 mice_3.6.0 desc_1.2.0
[28] labeling_0.3 scales_1.0.0 readr_1.3.1
[31] callr_3.3.2 stringr_1.4.0 digest_0.6.22
[34] minqa_1.2.4 GWASExactHW_1.01 XVector_0.24.0
[37] pkgconfig_2.0.3 fst_0.9.0 lme4_1.1-21
[40] sessioninfo_1.1.1 rlang_0.4.1 rstudioapi_0.10
[43] generics_0.0.2 dplyr_0.8.3 RCurl_1.95-4.12
[46] magrittr_1.5 GenomeInfoDbData_1.2.1 Matrix_1.2-17
[49] Rcpp_1.0.3 munsell_0.5.0 S4Vectors_0.22.1
[52] lifecycle_0.1.0 stringi_1.4.3 MASS_7.3-51.4
[55] zlibbioc_1.30.0 plyr_1.8.4 pkgbuild_1.0.6
[58] grid_3.6.1 parallel_3.6.1 mitml_0.3-7
[61] crayon_1.3.4 lattice_0.20-38 Biostrings_2.52.0
[64] splines_3.6.1 hms_0.5.2 zeallot_0.1.0
[67] ps_1.3.0 pillar_1.4.2 GenomicRanges_1.36.1
[70] boot_1.3-22 logistf_1.23 reshape2_1.4.3
[73] gdsfmt_1.20.0 stats4_3.6.1 pkgload_1.0.2
[76] pan_1.6 glue_1.3.1 data.table_1.12.6
[79] remotes_2.1.0 vctrs_0.2.0 nloptr_1.2.1
[82] testthat_2.3.0 gtable_0.3.0 purrr_0.3.3
[85] tidyr_1.0.0 SeqArray_1.24.2 assertthat_0.2.1
[88] ggplot2_3.2.1 broom_0.5.2 SeqVarTools_1.22.0
[91] survival_2.44-1.1 tibble_2.1.3 memoise_1.1.0
[94] IRanges_2.18.3 ellipsis_0.3.0 `
Your help is really appreciated! Cheers Aimee
data.zip