knausb / vcfR

Tools to work with variant call format files
245 stars 55 forks source link

Convert vcf with varying ploidy level #117

Open V-Z opened 5 years ago

V-Z commented 5 years ago

Hello, I imported vcf containing diploids and tetraploids as data_vcf <- read.vcfR(file="alpine.vcf.gz", limit=1e+10, nrows=-1, skip=0, cols=NULL, convertNA=TRUE, checkFile=TRUE, check_keys=TRUE, verbose=TRUE) and I'd like to convert it into DNAbin with extracted haplotypes (one haplotype per sequence). I use data_dna <- vcfR2DNAbin(x=data_vcf, extract.indels=FALSE, consensus=FALSE, extract.haps=TRUE, unphased_as_NA=FALSE, asterisk_as_del=FALSE, ref.seq=NULL, start.pos=NULL, verbose=TRUE). The ouptut generally looks nice, with sequences named like AA016_a_0, AA016_a_1, AA016ac_0, AA016ac_1, AA016ak_0, AA016ak_1, ..., but I have 2 sequences also for tetraploids, not only for diploids. I tried conversion into loci (data_loci <- vcfR2loci(x=data_vcf)) and there I get alleles like 1/1, 0/0, 1/1/1/1, 0/0/0/0, 0/0/0/1, .... It looks correct. Conversion to genind (data_genind <- vcfR2genind(x=data_vcf, sep="|")) gives me warning The observed allele dosage (0-7) does not match the defined ploidy (2-2). Please check that your input parameters (ncode, sep) are correct. and @ploidy slot shows only diploids. Conversion to genlight (data_genlight <- vcfR2genlight(x=data_vcf, n.cores=3)) shows warnings NAs introduced by coercion. I'm not sure, what exactly it means here, but @ploidy slot is empty and included SNPbin objects contain ploidies 2 or NA. So it seems some conversions of vcf with more ploidy levels do not work correctly. I didn't find much about handling of more ploidy levels in the documentation, but some conversions are obviously wrong.

knausb commented 5 years ago

Hello V-Z,

The function vcfR2DNAbin handles haploids and diploids but not higher ploids. You can access the documentation for this function as follows.

?vcfR2DNAbin

I've bolded the word ploidy there to make that section easy to find. The critical issue here is "how are you going to phase higher ploid data?" I do not know of any software that will accomplish that. You appear to be trying to set unphased_as_NA=FALSE, if this were to be functional you would generate chimeric sequences instead of recreating the chromosomes in your organism of interest. So I see no way to implement this in a reasonable manner.

At the time I wrote vcfR2genind all samples in the genind object had to be the same ploidy. I checked the adegenet github site and in their ChangeLog they report that as of version 2.0 the genind object can handle mixed ploidy. I was not aware of this update, thank you for bringing it to my attention. I suspect something similar has happened for the genlight object. I'll try to look into this

V-Z commented 5 years ago

Thank You, @knausb, for Your fast reply, I'm sorry, I overlooked that warning in documentation. For this particular purpose I wouldn't mind chimeric sequences. In any case, it's interesting which output I get with vcfR2DNAbin(). Do You think it'd be complicated to implement it for phased data? Anyway, what does the conversion do now (for unphased diploids)? It doesn't give me any warning that something could go wrong. Adegenet has been handling mixed ploidies pretty well, for genind as well as genlight, it works very well. Do You think there will be any changes in the vcfR export functions?

knausb commented 5 years ago

Hi V-Z,

I've documented the results for unphased diploids at the below link.

https://knausb.github.io/vcfR_documentation/dnabin.html

It uses IUPAC ambiguity codes. I see this as suboptimal. But I also see it as common practice. But it doesn't extend well beyond diploids.

Could you elaborate on why chimeric sequences could be useful? I can't think of a reason. And I kinda don't want to be an enabler for people to do silly things.

knausb commented 5 years ago

In order to address the adegenet conversion functions we need a minimal reproducible example. First we'll need example data.

library(vcfR)
data("vcfR_test")
vcfR_test

vcfR_test@gt

gt4 <- matrix(nrow=5, ncol=2)
colnames(gt4) <- c("NA00004", "NA00005")
gt4[1,] <- c("1|0|0|1:48:8:51,51,51,51", "1|0|0|1:48:8:51,51,51,51")
gt4[2,] <- c("1/0/0/1:48:8:51,51,51,51", "1/0/0/1:48:8:51,51,51,51")
gt4[3,] <- c("1/2/2/1:48:8:51,51,51,51", "1/2/2/1:48:8:51,51,51,51")
gt4[4,] <- c("0/0/1/2:48:8:51,51,51,51", "1/2/0/1:48:8:51,51,51,51")
gt4[5,] <- c("1/0/0/1:48:8", "1|0|0|1:48:8")
vcfR_test@gt <- cbind(vcfR_test@gt, gt4)
vcfR_test
vcfR_test@gt

This should give us a vcfR object where the samples are mixed ploid, but each individual sample is of one ploid. Now we can try to reproduce the errors.

myGenind <- vcfR2genind(vcfR_test)

results in

Loading required namespace: adegenet
Warning in adegenet::df2genind(t(x), sep = sep, ...) :
  The observed allele dosage (2-4) does not match the defined ploidy (2-2).
Please check that your input parameters (ncode, sep) are correct.

As reported, so we can work on this.

Genlight.

myGenlight <- vcfR2genlight(vcfR_test)

results in

Warning messages:
1: In vcfR2genlight(vcfR_test) : Found 2 loci with more than two alleles.
Objects of class genlight only support loci with two alleles.
2 loci will be omitted from the genlight object.
2: In initialize(value, ...) : NAs introduced by coercion
3: In initialize(value, ...) : NAs introduced by coercion

This does not appear to be the error reported. I'm using adegenet 2.1.1 and R 3.5.1 which appear to be the current versions on CRAN. @V-Z could you please report the versions of R and adegenet you're using as well as validate that this is not the message you received?

V-Z commented 5 years ago

Hello, I need to convert VCF into STRUCTURE input format. STRUCTURE works on frequencies of alleles, so I don't mind chimeras. FASTA can be very easily (sed etc.) converted into STRUCTURE. I admit this may be just corner case. My system:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=cs_CZ.UTF-8          LC_NUMERIC=C                  LC_TIME=cs_CZ.UTF-8          
 [4] LC_COLLATE=cs_CZ.UTF-8        LC_MONETARY=cs_CZ.UTF-8       LC_MESSAGES=cs_CZ.UTF-8      
 [7] LC_PAPER=cs_CZ.UTF-8          LC_NAME=cs_CZ.UTF-8           LC_ADDRESS=cs_CZ.UTF-8       
[10] LC_TELEPHONE=cs_CZ.UTF-8      LC_MEASUREMENT=cs_CZ.UTF-8    LC_IDENTIFICATION=cs_CZ.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pegas_0.11     ape_5.2        adegenet_2.1.1 ade4_1.7-13    vcfR_1.8.0     rkward_0.7.0  

loaded via a namespace (and not attached):
 [1] gtools_3.8.1      tidyselect_0.2.5  purrr_0.2.5       reshape2_1.4.3    splines_3.5.1     lattice_0.20-38  
 [7] expm_0.999-3      colorspace_1.4-0  htmltools_0.3.6   viridisLite_0.3.0 mgcv_1.8-25       rlang_0.3.0.1    
[13] pillar_1.3.0      later_0.7.5       glue_1.3.0        sp_1.3-1          bindrcpp_0.2.2    bindr_0.1.1      
[19] plyr_1.8.4        stringr_1.3.1     munsell_0.5.0     gtable_0.2.0      coda_0.19-2       permute_0.9-4    
[25] httpuv_1.4.5      parallel_3.5.1    spdep_0.7-9       Rcpp_1.0.0        pinfsc50_1.1.0    xtable_1.8-3     
[31] scales_1.0.0      promises_1.0.1    gdata_2.18.0      vegan_2.5-3       mime_0.6.1        deldir_0.1-15    
[37] ggplot2_3.1.0     digest_0.6.18     gmodels_2.18.1    stringi_1.2.4     dplyr_0.7.8       shiny_1.2.0      
[43] grid_3.5.1        LearnBayes_2.15.1 tools_3.5.1       magrittr_1.5      lazyeval_0.2.1    tibble_1.4.2     
[49] cluster_2.0.7-1   crayon_1.3.4      seqinr_3.4-6      pkgconfig_2.0.2   MASS_7.3-51.1     Matrix_1.2-15    
[55] spData_0.2.9.4    assertthat_0.2.0  R6_2.3.0          boot_1.3-20       igraph_1.2.2      nlme_3.1-137     
[61] compiler_3.5.1   

I'm indeed not sure if You get same report regarding genelight as I get:

> data_genlight <- vcfR2genlight(x=data_vcf, n.cores=3)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In initialize(value, ...) : NAs introduced by coercion
2: In initialize(value, ...) : NAs introduced by coercion
...

Did You check the Ploidy slot? If You need, I can send You the data privately (it is unpublished data set). We agree, that vcfR2loci works fine, don't we?

knausb commented 5 years ago

I agree that vcfR2loci appears to be working correctly. But that's a fairly simple one.

I'd prefer to work with the example data that I've generated. That way we can keep this conversation public. that's sort of what GitHub issues is about. We do not appear to be generating the same error with vcfR2genlight, but it may be you've got more than one issue. I'll work on my example and send something to you so you can test it on your data.

Its been a while since I worked with Structure. And I've never tried it with polyploids. More recently I've used fastStructure. But if I remember correctly it uses plink files and I do not think those support polyplpoids. Would a better solution be to create a vcfR2structure function in vcfR? Have you looked at strataG? It has a function strataG::structure(). Perhaps a solution would be a vcfR2gtypes? Sorry about all the questions but I haven't been working with these sorts of questions recently and it appears we have some options. And thanks for posting your sessionInfo().

V-Z commented 5 years ago

Hello, OK, I'll of course test any updated function. If You'd like to test with our data in the future, let me know. I don't use Plink, but STRUCTURE uses fairly simple format (similar to fastStructure, as far as I know), one line per genotype, it can handle various ploidy levels. Alleles are numbered (e.g. 1-4 instead of ATCG). I haven't looked at strataG::structure(), it looks very interesting, although it uses strataG object, which appears not to support mixed ploidies. So far I use to use ParellelStructure. It is "just" a wrapper to run STRUCTURE in parallel from R (You need "classical" STRUCTURE input and parameter files and binary). There are functions to handle STRUCTURE format in Adegenet, but I think they still handle only haploids or diploids. Of course, direct export from VCF into STRUCTURE would be nice. Conversion through FASTA was just the simplest way how to do it (=other functions, extra software, whatever, do not handle polyploids or mixing of ploidies well). Yes, working with various polyploids is still bit complex. ;-)

knausb commented 5 years ago

Hmm, it appears that the path to a gtypes object is through a multidna object which takes a DNAbin object as input. So it appears I've gone around in a circle and am back at creating a DNAbin object whether I like it or not. So it appears that is the path forward and I'll have to rely on warnings in the documentation.

knausb commented 5 years ago

Another issue has been raised that I feel is related to this one: https://github.com/knausb/vcfR/issues/121#issue-392703443. This is here to remind me these are related.

lneaves commented 5 years ago

Hi, I have a vcf with tetraploids generated by GATK pipeline. The file reads into R fine (I think using read.vcfR) but I got the errors listed above when I convert to genlight (vcfR2genlight), Warning messages: 1: In initialize(value, ...) : NAs introduced by coercion 2: In initialize(value, ...) : NAs introduced by coercion ... and the resulting genlight file says there is 100% missing data - which I'm fairly certain isn't correct.

Apologies if I have missed something obvious in the documentation - what could be causing the errors, the ploidy or something else? Thanks!

knausb commented 5 years ago

Hi @lneaves , I see this as a very different issue from this thread. You report fixed ploidy as opposed to varying ploidy. What you're asking for is very different. Please post a new issue. Also, please see this page where I've tried to provide suggestions on how to post a good issue including providing a minimal reproducible example.

dylanob commented 5 years ago

Hi @knausb! I appreciate all the effort that has gone in to creating and maintaining this package.

It appears that vcfR is the only piece of software capable of doing what I need to do, which is to import a polyploid (hexaploid) dataset into adegenet. However, when I try to run vcfR2genind, I get the same error that is detailed above (quoted below as well), warning that "the observed allele dosage (0-6) does not match the defined ploidy (2-2)." I also get the error using your test data referenced below, which makes me think that maybe the package can't handle polyploids yet. On the other hand, I have seen the package referenced in several papers that deal with polyploids, which makes me think that there is a way forward.

In any case, any updates on this would be appreciated! Thanks,

D

In order to address the adegenet conversion functions we need a minimal reproducible example. First we'll need example data.

library(vcfR)
data("vcfR_test")
vcfR_test

vcfR_test@gt

gt4 <- matrix(nrow=5, ncol=2)
colnames(gt4) <- c("NA00004", "NA00005")
gt4[1,] <- c("1|0|0|1:48:8:51,51,51,51", "1|0|0|1:48:8:51,51,51,51")
gt4[2,] <- c("1/0/0/1:48:8:51,51,51,51", "1/0/0/1:48:8:51,51,51,51")
gt4[3,] <- c("1/2/2/1:48:8:51,51,51,51", "1/2/2/1:48:8:51,51,51,51")
gt4[4,] <- c("0/0/1/2:48:8:51,51,51,51", "1/2/0/1:48:8:51,51,51,51")
gt4[5,] <- c("1/0/0/1:48:8", "1|0|0|1:48:8")
vcfR_test@gt <- cbind(vcfR_test@gt, gt4)
vcfR_test
vcfR_test@gt

This should give us a vcfR object where the samples are mixed ploid, but each individual sample is of one ploid. Now we can try to reproduce the errors.

myGenind <- vcfR2genind(vcfR_test)

results in

Loading required namespace: adegenet
Warning in adegenet::df2genind(t(x), sep = sep, ...) :
  The observed allele dosage (2-4) does not match the defined ploidy (2-2).
Please check that your input parameters (ncode, sep) are correct.

As reported, so we can work on this.

Genlight.

myGenlight <- vcfR2genlight(vcfR_test)

results in

Warning messages:
1: In vcfR2genlight(vcfR_test) : Found 2 loci with more than two alleles.
Objects of class genlight only support loci with two alleles.
2 loci will be omitted from the genlight object.
2: In initialize(value, ...) : NAs introduced by coercion
3: In initialize(value, ...) : NAs introduced by coercion

This does not appear to be the error reported. I'm using adegenet 2.1.1 and R 3.5.1 which appear to be the current versions on CRAN. @V-Z could you please report the versions of R and adegenet you're using as well as validate that this is not the message you received?

knausb commented 5 years ago

Hi @dylanob , I'm afraid I do not have any updates on this issue. I have not found much time to code lately. But I do consider this to be an open issue. Perhaps knowing that another individual is interested will help me raise its priority. Thanks!

dylanob commented 5 years ago

Hi there. Yeah, I understand. I'm going to keep working on a work-around, but at the end of the day I'm beginning to wonder if there's much point in trying to model these as polyploid at all. I suspect that they are allopolyploid, which means that most downstream software is not going to be able to deal with their genetics, given that I don't have any knowledge of which genotypes belong to which of the chromosome sets. But at a minimum, I'd like to be able to model the genotypes as hexaploid in poppr, where the software is able to use appropriate assumptions about genotype frequencies. Thanks again!

RGoess commented 4 years ago

Hello, I am having a similar issue. I have a vcf generated with GATK with diploid and triploid individuals. I'm also looking into importing my vcf into adegenet as genind but have not found a solution so far (my goal is also to use poppr in the end). Is there any update on this? Thanks.

knausb commented 4 years ago

Hi @RGoess , I'm afraid I do not have an update on this. I consider this an open issue because I would like to provide this functionality some day. But I do not currently work on any projects that have mixed ploidy samples (at least as far as we know) so this is a low priority for me. It also sounds like a high labor feature to implement, and I just haven't found the time to work on this. I'm afraid I would recommend that you do not wait for vcfR to implement this feature. Hopefully someday...