knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

read.vcfR() not recognizing missing data #207

Closed loukiaspi closed 1 year ago

loukiaspi commented 1 year ago

Hi Brian,

I have been having an issue with how vcfR::read.vcfR handles the missing data in my files and I don't know if it's an ipyrad issue, a me issue, or an issue of the function.

The issue that I am facing is that read.vcfR doesn't seem to read correctly my missing data and the result is that it always tells me that I have "0 percent missing data", even though I know this to not be true by visually inspecting my .vcf. By playing around for a bit I realized that the issue lies somewhere within the way FORMAT is specified in my file but I don't know how to circumvent this. In my file FORMAT is specified as such: GT:DP:CATG and a locus with missing data would be indicated like this: ./.:0:0,0,0,0 . When it's written like this, read.vcfR reads it an not missing, but when I manually edit my missing genotype to be like this: ./. , skipping all the zeroes, it manages to read it as a missing. Is there a way to ask/make read.vcfR to recognize ./.:0:0,0,0,0 as a missing locus?

I am not at liberty to share my full .vcf file, but I have trimmed it so that it now contains just two samples and 10 loci and I am attaching it so that you can see if you can reproduce my issue. I am also attaching a second file in which I have manually edited the missing genotypes so that read.vcfR can read them as such. I have left the @meta part of the file intact so that you can see the assembly method, software etc. diamina_reduced.txt diamina_reduced_ManualEditing.txt

I am using the read.vcfR function as follows: my.vcf <- vcfR::read.vcfR("diamina_reduced.vcf", verbose = TRUE, convertNA = TRUE)

Finally, this is what my sessionInfo() looks like:

sessionInfo() R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale: [1] LC_COLLATE=English_World.utf8 LC_CTYPE=English_World.utf8 LC_MONETARY=English_World.utf8 [4] LC_NUMERIC=C LC_TIME=English_World.utf8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] SNPfiltR_1.0.0 vcfR_1.13.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.9 ape_5.6-2 lattice_0.20-45 deldir_1.0-6 png_0.1-8 assertthat_0.2.1
[7] digest_0.6.31 utf8_1.2.2 mime_0.12 R6_2.5.1 plyr_1.8.8 gdsfmt_1.34.0
[13] evaluate_0.19 ggplot2_3.4.0 pillar_1.8.1 adegraphics_1.0-16 rlang_1.0.6 adegenet_2.1.9
[19] rstudioapi_0.14 vegan_2.6-4 Matrix_1.5-3 rmarkdown_2.19 splines_4.2.2 pinfsc50_1.2.0
[25] stringr_1.5.0 igraph_1.3.5 munsell_0.5.0 shiny_1.7.4 xfun_0.36 compiler_4.2.2
[31] httpuv_1.6.8 pkgconfig_2.0.3 mgcv_1.8-41 htmltools_0.5.4 tidyselect_1.2.0 tibble_3.1.8
[37] memuse_4.2-2 fansi_1.0.3 permute_0.9-7 viridisLite_0.4.1 dplyr_1.0.10 later_1.3.0
[43] MASS_7.3-58.1 grid_4.2.2 nlme_3.1-161 xtable_1.8-4 gtable_0.3.1 lifecycle_1.0.3
[49] DBI_1.1.3 magrittr_2.0.3 scales_1.2.1 KernSmooth_2.23-20 cli_3.6.0 stringi_1.7.12
[55] reshape2_1.4.4 promises_1.2.0.1 sp_1.5-1 SNPRelate_1.32.1 latticeExtra_0.6-30 seqinr_4.2-23
[61] ellipsis_0.3.2 generics_0.1.3 vctrs_0.5.1 RColorBrewer_1.1-3 tools_4.2.2 ade4_1.7-20
[67] interp_1.1-3 glue_1.6.2 jpeg_0.1-10 parallel_4.2.2 fastmap_1.1.0 yaml_2.3.6
[73] colorspace_2.0-3 cluster_2.1.4 knitr_1.41

Thanks in advance for your time and let me know if you need more info from me

knausb commented 1 year ago

Hi @loukiaspi , this is not a 'you' issue. I interpret this as a VCF complexity. I've placed some documentation at the link below.

https://knausb.github.io/vcfR_documentation/missing_data.html

In VCF data the genotype typically is reported with associated information as a single string. Your example uses './.:0:0,0,0,0' which includes a missing genotype as './.' but there are also zeros associated with this missing genotype. Zeros are not missing data but are hopefully there to indicate zero of something. The vcfR 'show' method is intended to be a quick view of the data and will only recognize data which only includes the missing genotype. In order to get a better perspective on the number of missing genotypes you'll need to use 'extract.gt()' in order to separate the genotypes from the rest of the information in each string. This should convert the VCF missing data indicators (.) to R missing data (NA). This comes with a performance cost, which is part of the reason for not including this step in the 'show' method. Does this provide you with the information you need?

loukiaspi commented 1 year ago

Thanks for the super quick and detailed response! I read through the documentation and I now understand how 'read.vcfR()' works in regards to missingness and how to solve my issue with the 'extract.gt()' function, hence I am closing the issue

J-Moravec commented 9 months ago

This IMHO should be documented in the function's output. I encountered this issue and was about to submit PR.