Error Converting VCF to Genepop

alexkrohn commented 1 year ago

I'm encountering a problem that I don't fully understand when trying to convert a VCF to Genepop format (to use in NeEstimator).

Here is the code that I'm running:

library(radiator)
library(tidyverse)

# Import inds from vcf
inds <- extract_individuals_vcf("~/myvcf.vcf")

# Import population names for each ind
pop.names <- read.table("~/pop-file.csv", sep = ",", header=TRUE)

# Create strata df with inds and pop
strata.df <- inds %>%
  mutate(STRATA = pop.names$pop[match(inds$INDIVIDUALS, pop.names$ind)])

# Read vcf with strata file to assign inds to pop
vcf.radiator <- read_vcf("~/myvcf.vcf",
                         strata = strata.df)

# Write genepop file with popnames
write_genepop(vcf.radiator,
filename = "~/genepop.txt")

The pop-file.csv contains two columns: ind with a list of individual names as they appear in the VCF, and pop with their population assignments.

The strata do seem to be properly imported to the VCF. read_vcf() outputs:

################################################################################
############################## radiator::read_vcf ##############################
################################################################################
Execution date@time: 20230331@1006
Folder created: read_vcf_20230331@1006
Function call and arguments stored in: radiator_read_vcf_args_20230331@1006.tsv
File written: random.seed (134852)                                  
✔ Reading VCF [727ms]
Analyzing VCF
VCF source: ipyrad_v.0.9.81
Data is bi-allelic
Synchronizing data and strata...
    Number of strata: 4
    Number of individuals: 104
Reads assembly: reference-assisted

Cleaning VCF
Filters parameters file generated: filters_parameters_20230331@1006.tsv

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 2 / 0 / 9

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 2118 / 0 / 3948

Preparing output files...
File written: whitelist.markers.tsv                                 
File written: blacklist.markers.tsv                                 
File written: strata.filtered.tsv                                   

VCF statistics per individuals and markers
Generating statistics
✔ SNPs per locus [17ms]
✔ SNP position on the read [32ms]
✔ Missing genotypes [84ms]
✔ Heterozygosity [141ms]
ℹ MAC

Caution: More than 2 alleles detected in the dataset
File written: markers_number_alleles_problem.tsv                    
✔ MAC [78ms]
✔ Coverage ... [378ms]
✔ Generating figures [1.4s]
✔ Writing files [16ms]                                              
################################### SUMMARY ####################################

VCF summary
Missing data: 
    markers: 0.19
    individuals: 0.19

Coverage info:
    individuals mean total coverage: 97246
    individuals mean genotype coverage: 20
    markers mean coverage: 20

VCF info:
Number of chromosome/contig/scaffold: 2127
Number of locus: 1
Number of markers: 4875
Number of strata: 4
Number of individuals: 104

Number of ind/strata:
MS = 44
APP = 24
Other = 28
SUW = 8

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20230331@1006.gds

Computation time, overall: 4 sec
############################## completed read_vcf ##############################

However, write_genepop() gives an error saying that the column STRATA doesn't exist. Any idea why this might be? Here's the full write_genepop() error:

LOCUS field empty... adding unique id instead
Calibrating REF/ALT alleles...
✔ Reading data [200ms]
✔ Recoding genotypes... [244ms]
Error in `dplyr::select()`:
! Can't subset columns that don't exist.
✖ Column `STRATA` doesn't exist.
Run `rlang::last_trace()` to see where the error occurred.
✖ Preparing data [46ms]

thierrygosselin commented 1 year ago

Dear Alex,

Big reptile fan here... Several at tortoises, turtles and lizards at home.

Try running genomic_converter:

test <- radiator::genomic_converter(data = "myvcf.vcf", strata = strata.df, output = "genepop", filename = "genepop.txt")

You can always send me in private the files to speed up the debugging

Best Thierry

alexkrohn commented 1 year ago

Hi Thierry!

Well I'm a big fan of Montreal, and radiator, so the admiration is mutual!

I tried running genomic_converter, but got an error during the reading step:

...
Genotypes formats generated with 4875 SNPs: 
    GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): TRUE
    GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): TRUE
    GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): TRUE
    GT (the genotype coding 'a la genepop': 001002, 001001, 000000): TRUE
LOCUS field empty... adding unique id instead
DP column: cleaning and renaming to READ_DEPTH
CATG columns: cleaning and renaming C_DEPTH, A_DEPTH, T_DEPTH, G_DEPTH
Error in `dplyr::mutate()`:
ℹ In argument: `ALLELE_REF_DEPTH = dplyr::case_when(...)`.
Caused by error in `dplyr::case_when()`:
! Failed to evaluate the left-hand side of formula 1.
Caused by error:
! object 'REF' not found

Interestingly, read_vcf works fine here. Is there a place to specify that this is a VCF generated from a de novo assembly with ipyrad?

As an aside, I'm working to migrate my scripts away from PGDSpider, so I definitely appreciate the genomic_converter script that you've created.

thierrygosselin commented 1 year ago

I usually deal with stacks, ipyrad vcf specificity automatically Do you mind sharing via email the vcf and pop map I'm not able to reproduce with my ipyrad vcf test files.

thierrygosselin commented 1 year ago

With latest push (v.1.2.8) problem should be fixed

thierrygosselin commented 1 year ago

Depending on what you intend to do, several route to get the genepop file and all should work:

data <- radiator::read_vcf(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv")
radiator::write_genepop(data = data)

or this

data.tidy <- radiator::tidy_vcf(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv")
radiator::write_genepop(data = data.tidy)

or this:

test <- radiator::genomic_converter(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv", output = "genepop")

thierrygosselin commented 1 year ago

But since you want to work with NeEstimator and I assume your data is filtered, I did a couple checks. Assuming your stratification was ok, when you read the VCF file you should have this:

################################################################################
############################## radiator::read_vcf ##############################
################################################################################
Execution date@time: 20230403@1705
Folder created: read_vcf_20230403@1705
Function call and arguments stored in: radiator_read_vcf_args_20230403@1705.tsv
File written: random.seed (175401)                                  
✔ Reading VCF [4.3s]
Analyzing VCF
VCF source: ipyrad_v.0.9.81
Data is bi-allelic
Synchronizing data and strata...                                                                         
    Number of strata: 4
    Number of individuals: 104
Reads assembly: reference-assisted

Cleaning VCF
Filters parameters file generated: filters_parameters_20230403@1705.tsv

Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 2 / 0 / 9

Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 2118 / 0 / 3948

Preparing output files...
File written: whitelist.markers.tsv                                 
File written: blacklist.markers.tsv                                 
File written: strata.filtered.tsv                                   

VCF statistics per individuals and markers
Generating statistics
✔ SNPs per locus [41ms]
✔ SNP position on the read [80ms]
✔ Missing genotypes [310ms]
✔ Heterozygosity [432ms]
ℹ MAC

Caution: More than 2 alleles detected in the dataset
File written: markers_number_alleles_problem.tsv                    
✔ MAC [296ms]
✔ Coverage ... [1.3s]
✔ Generating figures [3.1s]
✔ Writing files [34ms]                                              
################################### SUMMARY ####################################

VCF summary
Missing data: 
    markers: 0.19
    individuals: 0.19

Coverage info:
    individuals mean total coverage: 97246
    individuals mean genotype coverage: 20
    markers mean coverage: 20

VCF info:
Number of chromosome/contig/scaffold: 2127
Number of locus: 1
Number of markers: 4875
Number of strata: 4
Number of individuals: 104

Number of ind/strata:
MS = 44
APP = 24
Other = 28
SUW = 8

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20230403@1705.gds

Computation time, overall: 18 sec
############################## completed read_vcf ##############################

You see that 9 SNPs were removed because it was considered monomorphic. This usually happens when you have low coverage and remove samples. The program you used didn't check reference/alternate alleles and stuff like that

radiator also automatically removed the markers not in common between your population/strata, it's quite a few 45% so I would check the filters you did before generating that VCF.

You can check the folder: 03_filter_common_markers look for the upsetR plot (png file) you will see that it's your SUW pop driving down the SNP number.

thierrygosselin commented 1 year ago

2 more checks I did...because of the red flags I got from looking at the read_vcf folder

duplicate samples

dup <- radiator::detect_duplicate_genomes(data = data)

the first one looks for duplicates. Technical ones forgotten or real ones generated by lab problems. Answer no during the function run. You don't want to blacklist samples at the first run. You want to check the figures it will generate.

When you look at the Manhattan plot. Usually samples < 0.25 are considered duplicate or I look really hard into the pair of samples. They are not that much, usually.

And depending on projects, the pair of samples around the 0.5 mark are usually considered close kin... and if you run colony or Kinference it will highlight the relationship deeper.

If you have dots that are grey, it means that the samples pair are from different strata/pop. Usually not good if it's < or around 0.25. I would use the file individuals.pairwise.dist.tsv open in excel and look at the samples...

Conclusion:

very curious what's the species
you seem to have a problem with duplicates or it's a voluntary one (= technical replicates)
I wouldn't do a run NeEstimator with this dataset...

thierrygosselin commented 1 year ago

mixed samples...

mix <- radiator::detect_mixed_genomes(data = data)

Inspect plots and tables in folder created...
    Do you want to exclude individuals based on heterozygosity ? (y/n):

Answer no to this.

Have a look at the boxplot and Manhattan figures. You will understand why the previous analysis generated false-closekin ...

From the Manhattan plot I would say that you have a missing data problem:

it's generating an individual heterozygosity problem
probably that this is because you have varying DNA quality (bad samples)
or that several different people contributed to the DNA extraction or other wet-lab steps...
SUW are all outliers in terms of missing data
APP is very different as well and in this strata you have an outlier maybe even a different specie
the missingness and heterozygosity problems are driving the duplicates ... and false close kin
so this will also generate all kind of bias in genetic analysis
be careful
you can write by email if you want more explanations...

alexkrohn commented 1 year ago

Thanks for your detailed explanation, Thierry. I really appreciate it.

This dataset is from Alligator Snapping Turtles, and the three strata represent different species. Even the among-population divergence is quite strong in these species (FST ~ 0.4), but the among-species divergence is higher than I've ever seen (FST ~ 0.8, despite only being separated by ~50 km).

It's not surprising that there are lots of missing sites among the species. I came to this conclusion separately, but it is great to know how easy it can be done in radiator.

I was running this dataset through NeEstimator just out of curiosity to see if any of these species might meet the proposed IUCN cutoff for low genetic diversity (Ne << 50). I knew it wasn't the best way to calculate Ne, but I appreciate your feedback nonetheless. Thanks for helping me get radiator working with this vcf.

thierrygosselin commented 1 year ago

Different species make sense. MATE094 and MATE263 are probably not in the good strata. But even that, the duplicate analysis shows problematic close kin relationship. If it's really different species, I strongly suggest filtering separately before running NeEstimator or LDNe.

Good Luck with your analysis

alexkrohn commented 1 year ago

Thanks for your help!

On Fri, Apr 7, 2023 at 9:20 AM Thierry Gosselin @.***> wrote:

Different species make sense. MATE094 and MATE263 are probably not in the good strata. But even that, the duplicate analysis shows problematic close kin relationship. If it's really different species, I strongly suggest filtering separately before running NeEstimator or LDNe.

Good Luck with your analysis

— Reply to this email directly, view it on GitHub https://github.com/thierrygosselin/radiator/issues/177#issuecomment-1500283625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAEIHCPLQJMB7TPXVSVONDXAAIA5ANCNFSM6AAAAAAWOWZ3LI . You are receiving this because you authored the thread.Message ID: @.***>

alexkrohn commented 1 year ago

Hey @thierrygosselin. Still having the same problem with genomic_converter() runing v.1.2.8. I'm using a different VCF output from ipyrad this time, but the same species. Here's my sessionInfo()

R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin22.3.0 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
LAPACK: /opt/homebrew/Cellar/r/4.2.3/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] radiator_1.2.8      googlesheets4_1.1.0 hierfstat_0.5-11    adegenet_2.1.10     ade4_1.7-22         vcfR_1.14.0        
 [7] lubridate_1.9.2     forcats_1.0.0       stringr_1.5.0       dplyr_1.1.2         purrr_1.0.1         readr_2.1.4        
[13] tidyr_1.3.0         tibble_3.2.1        ggplot2_3.4.2       tidyverse_2.0.0     doParallel_1.0.17   iterators_1.0.14   
[19] foreach_1.5.2      

loaded via a namespace (and not attached):
  [1] googledrive_2.1.0      colorspace_2.1-0       seqinr_4.2-30          ellipsis_0.3.2         rprojroot_2.0.3       
  [6] XVector_0.38.0         GenomicRanges_1.50.2   fs_1.6.1               rstudioapi_0.14        farver_2.1.1          
 [11] remotes_2.4.2          ggrepel_0.9.3          bit64_4.0.5            fansi_1.0.4            codetools_0.2-19      
 [16] splines_4.2.3          cachem_1.0.7           memuse_4.2-3           pkgload_1.3.2          jsonlite_1.8.4        
 [21] cluster_2.1.4          shiny_1.7.4            compiler_4.2.3         httr_1.4.5             Matrix_1.5-4          
 [26] fastmap_1.1.1          gargle_1.3.0           cli_3.6.1              later_1.3.0            prettyunits_1.1.1     
 [31] htmltools_0.5.5        tools_4.2.3            igraph_1.4.2           gtable_0.3.3           glue_1.6.2            
 [36] GenomeInfoDbData_1.2.9 reshape2_1.4.4         rappdirs_0.3.3         Rcpp_1.0.10            cellranger_1.1.0      
 [41] vctrs_0.6.2            Biostrings_2.66.0      ape_5.7-1              nlme_3.1-162           pinfsc50_1.2.0        
 [46] ps_1.7.4               miniUI_0.1.1.1         timechange_0.2.0       mime_0.12              lifecycle_1.0.3       
 [51] devtools_2.4.5         MASS_7.3-58.3          zlibbioc_1.44.0        scales_1.2.1           vroom_1.6.1           
 [56] ragg_1.2.5             hms_1.1.3              promises_1.2.0.1       gdsfmt_1.34.1          rematch2_2.1.2        
 [61] curl_5.0.0             memoise_2.0.1          gridExtra_2.3          UpSetR_1.4.0           stringi_1.7.12        
 [66] paletteer_1.5.0        desc_1.4.2             S4Vectors_0.36.2       permute_0.9-7          BiocGenerics_0.44.0   
 [71] pkgbuild_1.4.0         GenomeInfoDb_1.34.9    rlang_1.1.1            pkgconfig_2.0.3        systemfonts_1.0.4     
 [76] bitops_1.0-7           matrixStats_0.63.0     SeqArray_1.38.0        lattice_0.21-8         htmlwidgets_1.6.2     
 [81] labeling_0.4.2         processx_3.8.0         bit_4.0.5              tidyselect_1.2.0       plyr_1.8.8            
 [86] magrittr_2.0.3         R6_2.5.1               profvis_0.3.7          IRanges_2.32.0         generics_0.1.3        
 [91] pillar_1.9.0           withr_2.5.0            mgcv_1.8-42            RCurl_1.98-1.12        crayon_1.5.2          
 [96] utf8_1.2.3             urlchecker_1.0.1       tzdb_0.3.0             usethis_2.1.6          grid_4.2.3            
[101] data.table_1.14.8      callr_3.7.3            vegan_2.6-4            digest_0.6.31          xtable_1.8-4          
[106] httpuv_1.6.9           textshaping_0.3.6      openssl_2.0.6          stats4_4.2.3           munsell_0.5.0         
[111] viridisLite_0.4.2      sessioninfo_1.2.2      askpass_1.1

thierrygosselin / radiator

Error Converting VCF to Genepop #177