knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

vcfR2genlight failed to convert polyploid of mixed ploidy vcf (obtained by GATK) #158

Open lihe-s opened 4 years ago

lihe-s commented 4 years ago

@knausb hope this make the issue clear

R version 3.6.2 (2019-12-12) on HPC clusters require(vcfR)

vcf <- read.vcfR("di-tetra-hex-test.vcf.gz", verbose = FALSE)

conversion

x <- vcfR2genlight(vcf)

Warning messages: 1: In vcfR2genlight(vcf) : Found 2028 loci with more than two alleles. Objects of class genlight only support loci with two alleles. 2028 loci will be omitted from the genlight object. 2: In initialize(value, ...) : NAs introduced by coercion 3: In initialize(value, ...) : NAs introduced by coercion

ploidy level, the BSJ_1 should be hexaploid(6), BSJ_4 should be tetraploid(4)

ploidy(x) BSJ_1 BSJ_2 BSJ_3 BSJ_4 BSJ_5 BSJ_6 NA 2 2 NA 2 2

part of genotype data

vcf@gt[1,] FORMAT BSJ_1 BSJ_2 BSJ_3 BSJ_4 BSJ_5 BSJ_6 "GT:AD:DP:GQ:PL" "0/0/0/0/0/0:13,0:13:10:0,10,23,39,62,101,543" "0/0:3,0:3:9:0,9,113" "0/0:5,0:5:15:0,15,197" "0/0/1/1:8,13:21:4:508,23,0,4,339" "0/0:4,0:4:12:0,12,167" "./.:0,0:0:.:0,0,0"

after conversion

t(as.matrix(x))[c(1:50), 1:3] BSJ_1 BSJ_2 BSJ_3 chr01_1339 NA 0 0 chr01_1405 NA 2 2 chr01_1450 NA 0 0 chr01_3565 NA 2 2 chr01_3610 NA 2 2 chr01_3658 NA 0 0 chr01_3691 NA 0 0 chr01_3709 NA 0 0 chr01_3748 NA 0 0 chr01_3760 NA 0 0 chr01_3913 NA NA NA chr01_4054 NA NA NA chr01_4153 NA NA NA chr01_4189 NA NA NA chr01_4195 NA NA NA chr01_4198 NA NA NA

It looks like the vcfR2genlight can not convert polyploid samples. This is not a rare issue #128 #117 (also a recent paper https://doi.org/10.1038/s41559-019-0807-4). Please consider to help me figure it out. Thanks.

Sorry for the irregular format.

My real world test data will send to you.

lihe-s commented 4 years ago

@knausb I can not find your email address. Hence, I attached my test data(unpublished) here. Once you download it I will delete it.

knausb commented 4 years ago

Hi @lihe-s ,

apologies for the slow response! R 4.0.0 was released, dplyr 1.0.0 is getting ready for release. And there's this COVID-19 thing. And I just accepted a new job. So things have been a bit busy. Here's what I see.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.11.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****

vcf <- read.vcfR("~/Desktop/sandbox/lihe/di-tetra-hex-test.vcf.gz")
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 67
#>   header_line: 68
#>   variant count: 26851
#>   column count: 15
#> Meta line 67 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 26851
#>   Character matrix gt cols: 15
#>   skip: 0
#>   nrows: 26851
#>   row_num: 0
#> Processed variant 1000Processed variant 2000Processed variant 3000Processed variant 4000Processed variant 5000Processed variant 6000Processed variant 7000Processed variant 8000Processed variant 9000Processed variant 10000Processed variant 11000Processed variant 12000Processed variant 13000Processed variant 14000Processed variant 15000Processed variant 16000Processed variant 17000Processed variant 18000Processed variant 19000Processed variant 20000Processed variant 21000Processed variant 22000Processed variant 23000Processed variant 24000Processed variant 25000Processed variant 26000Processed variant: 26851
#> All variants processed

vcf
#> ***** Object of Class vcfR *****
#> 6 samples
#> 1 CHROMs
#> 26,851 variants
#> Object size: 16.9 Mb
#> 0 percent missing data
#> *****        *****         *****
head(vcf)
#> [1] "***** Object of class 'vcfR' *****"
#> [1] "***** Meta section *****"
#> [1] "##fileformat=VCFv4.2"
#> [1] "##ALT=<ID=NON_REF,Description=\"Represents any possible alternative a [Truncated]"
#> [1] "##FILTER=<ID=LowQual,Description=\"Low quality\">"
#> [1] "##FILTER=<ID=PASS,Description=\"All filters passed\">"
#> [1] "##FILTER=<ID=my_filters,Description=\"QD < 2.0 || FS > 60.0 || MQ < 4 [Truncated]"
#> [1] "##FORMAT=<ID=AD,Number=R,Type=Integer,Description=\"Allelic depths fo [Truncated]"
#> [1] "First 6 rows."
#> [1] 
#> [1] "***** Fixed section *****"
#>      CHROM   POS    ID REF ALT QUAL       FILTER
#> [1,] "chr01" "1339" NA "G" "A" "479.04"   "PASS"
#> [2,] "chr01" "1405" NA "G" "A" "25686.77" "PASS"
#> [3,] "chr01" "1450" NA "C" "T" "738.45"   "PASS"
#> [4,] "chr01" "3565" NA "G" "A" "9565.07"  "PASS"
#> [5,] "chr01" "3610" NA "G" "T" "16297.58" "PASS"
#> [6,] "chr01" "3658" NA "A" "G" "2039.43"  "PASS"
#> [1] 
#> [1] "***** Genotype section *****"
#>      FORMAT                  
#> [1,] "GT:AD:DP:GQ:PL"        
#> [2,] "GT:AD:DP:GQ:PGT:PID:PL"
#> [3,] "GT:AD:DP:GQ:PL"        
#> [4,] "GT:AD:DP:GQ:PGT:PID:PL"
#> [5,] "GT:AD:DP:GQ:PGT:PID:PL"
#> [6,] "GT:AD:DP:GQ:PGT:PID:PL"
#>      BSJ_GL19_42.clean                              
#> [1,] "0/0/0/0/0/0:13,0:13:10:0,10,23,39,62,101,543" 
#> [2,] "./././././.:24,0:24:.:.:.:0,0,0,0,0,0,0"      
#> [3,] "0/0/0/0/0/0:22,0:22:17:0,17,37,63,100,163,945"
#> [4,] "0/0/0/0/0/0:9,0:9:0:.:.:0,0,0,0,0,0,224"      
#> [5,] "0/0/0/0/0/1:6,1:7:3:.:.:32,0,3,9,18,35,257"   
#> [6,] "0/0/0/0/0/0:13,0:13:0:.:.:0,0,0,0,7,38,398"   
#>      BSJ_GL19_46.clean                   
#> [1,] "0/0:3,0:3:9:0,9,113"               
#> [2,] "1/1:0,6:6:18:.:.:270,18,0"         
#> [3,] "0/0:14,0:14:42:0,42,532"           
#> [4,] "1/1:0,5:5:15:1|1:3565_G_A:225,15,0"
#> [5,] "1/1:0,4:4:12:.:.:180,12,0"         
#> [6,] "0/0:2,0:2:6:0|1:3658_A_G:0,6,90"   
#>      BSJ_GL19_47.clean                   
#> [1,] "0/0:5,0:5:15:0,15,197"             
#> [2,] "1/1:0,8:8:24:.:.:340,24,0"         
#> [3,] "0/0:16,0:16:48:0,48,588"           
#> [4,] "1/1:0,5:5:15:1|1:3565_G_A:225,15,0"
#> [5,] "1/1:0,7:7:21:1|1:3600_T_G:312,21,0"
#> [6,] "0/0:9,0:9:0:.:.:0,0,270"           
#>      BSJ_GL19_48.clean                        
#> [1,] "0/0/1/1:8,13:21:4:508,23,0,4,339"       
#> [2,] "0/0/0/1:22,8:30:16:.:.:283,0,16,72,958" 
#> [3,] "0/0/0/0:40,0:40:44:0,44,105,211,1575"   
#> [4,] "0/0/0/0:40,0:40:13:.:.:0,13,78,193,1474"
#> [5,] "0/0/0/0:36,0:36:0:.:.:0,0,0,0,927"      
#> [6,] "0/0/0/0:37,0:37:0:.:.:0,0,27,128,1246"  
#>      BSJ_GL19_49.cleancb                 
#> [1,] "0/0:4,0:4:12:0,12,167"             
#> [2,] "./.:8,0:8:.:.:.:0,0,0"             
#> [3,] "0/0:6,0:6:12:0,12,180"             
#> [4,] "0/0:9,0:9:27:.:.:0,27,338"         
#> [5,] "1/1:0,8:8:24:1|1:3600_T_G:360,24,0"
#> [6,] "0/0:7,0:7:15:.:.:0,15,225"         
#> [1] "First 6 columns only."
#> [1] 
#> [1] "Unique GT formats:"
#> [1] "GT:AD:DP:GQ:PL"         "GT:AD:DP:GQ:PGT:PID:PL"
#> [3] "GT:AD:DP:PGT:PID:PL"    "GT:AD:DP:PL"           
#> [1]

# vcfR2genlight uses extract.gt(), so let's see if it works.

gt <- extract.gt(vcf)
gt[1:4, ]
#>            BSJ_GL19_42.clean BSJ_GL19_46.clean BSJ_GL19_47.clean
#> chr01_1339 "0/0/0/0/0/0"     "0/0"             "0/0"            
#> chr01_1405 NA                "1/1"             "1/1"            
#> chr01_1450 "0/0/0/0/0/0"     "0/0"             "0/0"            
#> chr01_3565 "0/0/0/0/0/0"     "1/1"             "1/1"            
#>            BSJ_GL19_48.clean BSJ_GL19_49.cleancb BSJ_GL19_50.cleancb
#> chr01_1339 "0/0/1/1"         "0/0"               NA                 
#> chr01_1405 "0/0/0/1"         NA                  "1/1"              
#> chr01_1450 "0/0/0/0"         "0/0"               "0/0"              
#> chr01_3565 "0/0/0/0"         "0/0"               "0/0"

my_gl <- vcfR2genlight(vcf)
#> Warning in vcfR2genlight(vcf): Found 2028 loci with more than two alleles.
#> Objects of class genlight only support loci with two alleles.
#> 2028 loci will be omitted from the genlight object.
#> Loading required namespace: adegenet
#> Registered S3 method overwritten by 'spdep':
#>   method   from
#>   plot.mst ape
#> Warning in initialize(value, ...): NAs introduced by coercion
#> Warning in initialize(value, ...): NAs introduced by coercion
my_gl
#>  /// GENLIGHT OBJECT /////////
#> 
#>  // 6 genotypes,  24,823 binary SNPs, size: 2.1 Mb
#>  51032 (34.26 %) missing data
#> 
#>  // Basic content
#>    @gen: list of 6 SNPbin
#> 
#>  // Optional content
#>    @ind.names:  6 individual labels
#>    @loc.names:  24823 locus labels
#>    @chromosome: factor storing chromosomes of the SNPs
#>    @position: integer storing positions of the SNPs
#>    @other: a list containing: elements without names

my_gl@gen
#> [[1]]
#> /// SNPBIN OBJECT /////////
#>  24,823 SNPs coded as bits, size: 101.4 Kb
#>  Ploidy: NA
#>  24823 (100 %) missing data
#> 
#> [[2]]
#> /// SNPBIN OBJECT /////////
#>  24,823 SNPs coded as bits, size: 8.8 Kb
#>  Ploidy: 2
#>  327 (1.32 %) missing data
#> 
#> [[3]]
#> /// SNPBIN OBJECT /////////
#>  24,823 SNPs coded as bits, size: 9 Kb
#>  Ploidy: 2
#>  396 (1.6 %) missing data
#> 
#> [[4]]
#> /// SNPBIN OBJECT /////////
#>  24,823 SNPs coded as bits, size: 101.4 Kb
#>  Ploidy: NA
#>  24823 (100 %) missing data
#> 
#> [[5]]
#> /// SNPBIN OBJECT /////////
#>  24,823 SNPs coded as bits, size: 8.8 Kb
#>  Ploidy: 2
#>  333 (1.34 %) missing data
#> 
#> [[6]]
#> /// SNPBIN OBJECT /////////
#>  24,823 SNPs coded as bits, size: 8.8 Kb
#>  Ploidy: 2
#>  330 (1.33 %) missing data

as.matrix(my_gl)[1:4, 1:5]
#>                   chr01_1339 chr01_1405 chr01_1450 chr01_3565 chr01_3610
#> BSJ_GL19_42.clean         NA         NA         NA         NA         NA
#> BSJ_GL19_46.clean          0          2          0          2          2
#> BSJ_GL19_47.clean          0          2          0          2          2
#> BSJ_GL19_48.clean         NA         NA         NA         NA         NA

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.6 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] vcfR_1.11.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] viridisLite_0.3.0  splines_3.6.3      gtools_3.8.1      
#>  [4] shiny_1.4.0        memuse_4.0-0       expm_0.999-4      
#>  [7] sp_1.3-1           highr_0.8          yaml_2.2.0        
#> [10] LearnBayes_2.15.1  pillar_1.4.3       lattice_0.20-40   
#> [13] glue_1.3.2         digest_0.6.25      promises_1.1.0    
#> [16] colorspace_1.4-1   htmltools_0.4.0    httpuv_1.5.2      
#> [19] Matrix_1.2-18      plyr_1.8.4         pkgconfig_2.0.3   
#> [22] gmodels_2.18.1     purrr_0.3.3        xtable_1.8-4      
#> [25] scales_1.0.0       gdata_2.18.0       later_1.0.0       
#> [28] tibble_2.1.3       mgcv_1.8-31        ggplot2_3.2.1     
#> [31] lazyeval_0.2.2     magrittr_1.5       crayon_1.3.4      
#> [34] mime_0.7           deldir_0.1-23      evaluate_0.14     
#> [37] nlme_3.1-144       MASS_7.3-51.5      class_7.3-16      
#> [40] vegan_2.5-5        tools_3.6.3        lifecycle_0.2.0   
#> [43] stringr_1.4.0      munsell_0.5.0      cluster_2.1.0     
#> [46] ade4_1.7-13        compiler_3.6.3     e1071_1.7-2       
#> [49] rlang_0.4.5.9000   classInt_0.4-2     units_0.6-5       
#> [52] grid_3.6.3         igraph_1.2.4.1     rmarkdown_2.0     
#> [55] boot_1.3-23        gtable_0.3.0       DBI_1.0.0         
#> [58] reshape2_1.4.3     R6_2.4.1           knitr_1.23        
#> [61] dplyr_0.8.99.9002  pinfsc50_1.1.0     fastmap_1.0.1     
#> [64] seqinr_3.6-1       adegenet_2.1.1     spdep_1.1-3       
#> [67] KernSmooth_2.23-16 permute_0.9-5      ape_5.3           
#> [70] stringi_1.4.3      parallel_3.6.3     Rcpp_1.0.1        
#> [73] vctrs_0.2.99.9010  sf_0.8-0           spData_0.3.2      
#> [76] tidyselect_1.0.0   xfun_0.8           coda_0.19-2

Created on 2020-05-10 by the reprex package (v0.3.0)

I'm not saying that this is the behaviour you're looking for. But I am saying that I do not feel that I'm reproducing your reported behaviour. Usually we ask folks to upgrade to the latest versions of all software. The release of R 4.0.0 (2020-04-24) may come with some "growing pains", so I wouldn't recomend that. But I think you should check that you have similar versions to what I'm running with the hope that this will reproduce the behaviour I'm seeing on my end. And I'm going to add that I'm preparing a new release of vcfR to address the release of dplyr 1.0.0. So everything is changing, apologies. Thanks! Brian

lihe-s commented 4 years ago

@knausb Congratulations on your new job. Thanks a lot for your detailed issue description. So the new version of vcfR will compatible my situation? Regards, Li

knausb commented 4 years ago

Hi @lihe-s , I don't think you understand. I have failed to reproduce the behavior that you have reported. This means I can't address your issue. I suspect the issue is that you are using an outdated version of adegenet. In the above output I show that I'm using adegenet_2.1.1. I suggest you check to make sure that you are using that version, or a more recent version.

While my results look closer to success than your reported behavior, I don't think we have a solution. The results from extract.gt() show that sample BSJ_GL19_42.clean for variant chr01_1339 has the genotype "0/0/0/0/0/0" and variant chr01_1405 has the genotype NA. The genlight object shows your samples as diploid (perhaps we need to specify ploidy to help it) and shows both of these variants as NA. So I do not feel this data has been handled correctly. Do you feel these data were handled correctly? Thanks! Brian

lihe-s commented 4 years ago

@knausb Thanks for your reply. I have to say your results are similar to mine. It is true that "variant chr01_1339 has the genotype "0/0/0/0/0/0" and variant chr01_1405 has the genotype NA". If you check less -S the ".vcf.gz" file you will find the chr01_1405 is a missing site "./././././.". However, it is not that I asked. After you conversion by using "my_gl <- vcfR2genlight(vcf)", chr01_1339 "0/0/0/0/0/0" become NA. That's why I wrote "vcfR2genlight failed to convert polyploid of mixed ploidy vcf (obtained by GATK)". It means vcfR2genlight can only correctly convert diploid samples, but failed for polyploid samples. I hope this make my issue clear. Thanks again.

knausb commented 4 years ago

Hi @lihe-s , I'm afraid I now see that I was the one who did not understand. I reviewed the code for vcfR2genlight() and I'm afraid it's hard coded for diploids. I would like to support mixed ploid data sets. But I'm afraid I see that as a task that will require a lot of time and effort, and I do not feel I currently have the time. I suggest we leave this issue open with the hope I can get to it eventually. But it does not appear I will be able to help you in the near future with vcfR. You might be able to come up with a solution with vcfR::extract.gt() and creating a genlight object out of the matrix of genotypes. If you succeed, I'd appreciate it if you shared your solution with me. Thanks!