Closed alexkrohn closed 1 year ago
Dear Alex,
Big reptile fan here... Several at tortoises, turtles and lizards at home.
Try running genomic_converter
:
test <- radiator::genomic_converter(data = "myvcf.vcf", strata = strata.df, output = "genepop", filename = "genepop.txt")
You can always send me in private the files to speed up the debugging
Best Thierry
Hi Thierry!
Well I'm a big fan of Montreal, and radiator, so the admiration is mutual!
I tried running genomic_converter
, but got an error during the reading step:
...
Genotypes formats generated with 4875 SNPs:
GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): TRUE
GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): TRUE
GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): TRUE
GT (the genotype coding 'a la genepop': 001002, 001001, 000000): TRUE
LOCUS field empty... adding unique id instead
DP column: cleaning and renaming to READ_DEPTH
CATG columns: cleaning and renaming C_DEPTH, A_DEPTH, T_DEPTH, G_DEPTH
Error in `dplyr::mutate()`:
ℹ In argument: `ALLELE_REF_DEPTH = dplyr::case_when(...)`.
Caused by error in `dplyr::case_when()`:
! Failed to evaluate the left-hand side of formula 1.
Caused by error:
! object 'REF' not found
Interestingly, read_vcf
works fine here. Is there a place to specify that this is a VCF generated from a de novo assembly with ipyrad?
As an aside, I'm working to migrate my scripts away from PGDSpider, so I definitely appreciate the genomic_converter
script that you've created.
I usually deal with stacks, ipyrad vcf specificity automatically Do you mind sharing via email the vcf and pop map I'm not able to reproduce with my ipyrad vcf test files.
With latest push (v.1.2.8) problem should be fixed
Depending on what you intend to do, several route to get the genepop file and all should work:
data <- radiator::read_vcf(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv")
radiator::write_genepop(data = data)
or this
data.tidy <- radiator::tidy_vcf(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv")
radiator::write_genepop(data = data.tidy)
or this:
test <- radiator::genomic_converter(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv", output = "genepop")
But since you want to work with NeEstimator and I assume your data is filtered, I did a couple checks. Assuming your stratification was ok, when you read the VCF file you should have this:
################################################################################
############################## radiator::read_vcf ##############################
################################################################################
Execution date@time: 20230403@1705
Folder created: read_vcf_20230403@1705
Function call and arguments stored in: radiator_read_vcf_args_20230403@1705.tsv
File written: random.seed (175401)
✔ Reading VCF [4.3s]
Analyzing VCF
VCF source: ipyrad_v.0.9.81
Data is bi-allelic
Synchronizing data and strata...
Number of strata: 4
Number of individuals: 104
Reads assembly: reference-assisted
Cleaning VCF
Filters parameters file generated: filters_parameters_20230403@1705.tsv
Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 2 / 0 / 9
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
Blacklisted: 0 / 0 / 2118 / 0 / 3948
Preparing output files...
File written: whitelist.markers.tsv
File written: blacklist.markers.tsv
File written: strata.filtered.tsv
VCF statistics per individuals and markers
Generating statistics
✔ SNPs per locus [41ms]
✔ SNP position on the read [80ms]
✔ Missing genotypes [310ms]
✔ Heterozygosity [432ms]
ℹ MAC
Caution: More than 2 alleles detected in the dataset
File written: markers_number_alleles_problem.tsv
✔ MAC [296ms]
✔ Coverage ... [1.3s]
✔ Generating figures [3.1s]
✔ Writing files [34ms]
################################### SUMMARY ####################################
VCF summary
Missing data:
markers: 0.19
individuals: 0.19
Coverage info:
individuals mean total coverage: 97246
individuals mean genotype coverage: 20
markers mean coverage: 20
VCF info:
Number of chromosome/contig/scaffold: 2127
Number of locus: 1
Number of markers: 4875
Number of strata: 4
Number of individuals: 104
Number of ind/strata:
MS = 44
APP = 24
Other = 28
SUW = 8
Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20230403@1705.gds
Computation time, overall: 18 sec
############################## completed read_vcf ##############################
You see that 9 SNPs were removed because it was considered monomorphic. This usually happens when you have low coverage and remove samples. The program you used didn't check reference/alternate alleles and stuff like that
radiator also automatically removed the markers not in common between your population/strata, it's quite a few 45% so I would check the filters you did before generating that VCF.
You can check the folder: 03_filter_common_markers
look for the upsetR plot (png file) you will see that it's your SUW pop driving down the SNP number.
2 more checks I did...because of the red flags I got from looking at the read_vcf folder
dup <- radiator::detect_duplicate_genomes(data = data)
the first one looks for duplicates. Technical ones forgotten or real ones generated by lab problems.
Answer no
during the function run. You don't want to blacklist samples at the first run. You want to check the figures it will generate.
When you look at the Manhattan plot. Usually samples < 0.25 are considered duplicate or I look really hard into the pair of samples. They are not that much, usually.
And depending on projects, the pair of samples around the 0.5 mark are usually considered close kin... and if you run colony or Kinference it will highlight the relationship deeper.
If you have dots that are grey, it means that the samples pair are from different strata/pop. Usually not good if it's < or around 0.25. I would use the file individuals.pairwise.dist.tsv
open in excel and look at the samples...
Conclusion:
mix <- radiator::detect_mixed_genomes(data = data)
Inspect plots and tables in folder created...
Do you want to exclude individuals based on heterozygosity ? (y/n):
Answer no to this.
Have a look at the boxplot and Manhattan figures. You will understand why the previous analysis generated false-closekin ...
From the Manhattan plot I would say that you have a missing data problem:
Thanks for your detailed explanation, Thierry. I really appreciate it.
This dataset is from Alligator Snapping Turtles, and the three strata represent different species. Even the among-population divergence is quite strong in these species (FST ~ 0.4), but the among-species divergence is higher than I've ever seen (FST ~ 0.8, despite only being separated by ~50 km).
It's not surprising that there are lots of missing sites among the species. I came to this conclusion separately, but it is great to know how easy it can be done in radiator.
I was running this dataset through NeEstimator just out of curiosity to see if any of these species might meet the proposed IUCN cutoff for low genetic diversity (Ne << 50). I knew it wasn't the best way to calculate Ne, but I appreciate your feedback nonetheless. Thanks for helping me get radiator working with this vcf.
Different species make sense. MATE094 and MATE263 are probably not in the good strata. But even that, the duplicate analysis shows problematic close kin relationship. If it's really different species, I strongly suggest filtering separately before running NeEstimator or LDNe.
Good Luck with your analysis
Thanks for your help!
On Fri, Apr 7, 2023 at 9:20 AM Thierry Gosselin @.***> wrote:
Different species make sense. MATE094 and MATE263 are probably not in the good strata. But even that, the duplicate analysis shows problematic close kin relationship. If it's really different species, I strongly suggest filtering separately before running NeEstimator or LDNe.
Good Luck with your analysis
— Reply to this email directly, view it on GitHub https://github.com/thierrygosselin/radiator/issues/177#issuecomment-1500283625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAEIHCPLQJMB7TPXVSVONDXAAIA5ANCNFSM6AAAAAAWOWZ3LI . You are receiving this because you authored the thread.Message ID: @.***>
Hey @thierrygosselin. Still having the same problem with genomic_converter()
runing v.1.2.8. I'm using a different VCF output from ipyrad this time, but the same species. Here's my sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin22.3.0 (64-bit)
Running under: macOS Ventura 13.3.1
Matrix products: default
LAPACK: /opt/homebrew/Cellar/r/4.2.3/lib/R/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] radiator_1.2.8 googlesheets4_1.1.0 hierfstat_0.5-11 adegenet_2.1.10 ade4_1.7-22 vcfR_1.14.0
[7] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
[13] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0 doParallel_1.0.17 iterators_1.0.14
[19] foreach_1.5.2
loaded via a namespace (and not attached):
[1] googledrive_2.1.0 colorspace_2.1-0 seqinr_4.2-30 ellipsis_0.3.2 rprojroot_2.0.3
[6] XVector_0.38.0 GenomicRanges_1.50.2 fs_1.6.1 rstudioapi_0.14 farver_2.1.1
[11] remotes_2.4.2 ggrepel_0.9.3 bit64_4.0.5 fansi_1.0.4 codetools_0.2-19
[16] splines_4.2.3 cachem_1.0.7 memuse_4.2-3 pkgload_1.3.2 jsonlite_1.8.4
[21] cluster_2.1.4 shiny_1.7.4 compiler_4.2.3 httr_1.4.5 Matrix_1.5-4
[26] fastmap_1.1.1 gargle_1.3.0 cli_3.6.1 later_1.3.0 prettyunits_1.1.1
[31] htmltools_0.5.5 tools_4.2.3 igraph_1.4.2 gtable_0.3.3 glue_1.6.2
[36] GenomeInfoDbData_1.2.9 reshape2_1.4.4 rappdirs_0.3.3 Rcpp_1.0.10 cellranger_1.1.0
[41] vctrs_0.6.2 Biostrings_2.66.0 ape_5.7-1 nlme_3.1-162 pinfsc50_1.2.0
[46] ps_1.7.4 miniUI_0.1.1.1 timechange_0.2.0 mime_0.12 lifecycle_1.0.3
[51] devtools_2.4.5 MASS_7.3-58.3 zlibbioc_1.44.0 scales_1.2.1 vroom_1.6.1
[56] ragg_1.2.5 hms_1.1.3 promises_1.2.0.1 gdsfmt_1.34.1 rematch2_2.1.2
[61] curl_5.0.0 memoise_2.0.1 gridExtra_2.3 UpSetR_1.4.0 stringi_1.7.12
[66] paletteer_1.5.0 desc_1.4.2 S4Vectors_0.36.2 permute_0.9-7 BiocGenerics_0.44.0
[71] pkgbuild_1.4.0 GenomeInfoDb_1.34.9 rlang_1.1.1 pkgconfig_2.0.3 systemfonts_1.0.4
[76] bitops_1.0-7 matrixStats_0.63.0 SeqArray_1.38.0 lattice_0.21-8 htmlwidgets_1.6.2
[81] labeling_0.4.2 processx_3.8.0 bit_4.0.5 tidyselect_1.2.0 plyr_1.8.8
[86] magrittr_2.0.3 R6_2.5.1 profvis_0.3.7 IRanges_2.32.0 generics_0.1.3
[91] pillar_1.9.0 withr_2.5.0 mgcv_1.8-42 RCurl_1.98-1.12 crayon_1.5.2
[96] utf8_1.2.3 urlchecker_1.0.1 tzdb_0.3.0 usethis_2.1.6 grid_4.2.3
[101] data.table_1.14.8 callr_3.7.3 vegan_2.6-4 digest_0.6.31 xtable_1.8-4
[106] httpuv_1.6.9 textshaping_0.3.6 openssl_2.0.6 stats4_4.2.3 munsell_0.5.0
[111] viridisLite_0.4.2 sessioninfo_1.2.2 askpass_1.1
I'm encountering a problem that I don't fully understand when trying to convert a VCF to Genepop format (to use in NeEstimator).
Here is the code that I'm running:
The
pop-file.csv
contains two columns:ind
with a list of individual names as they appear in the VCF, andpop
with their population assignments.The strata do seem to be properly imported to the VCF.
read_vcf()
outputs:However,
write_genepop()
gives an error saying that the column STRATA doesn't exist. Any idea why this might be? Here's the fullwrite_genepop()
error: