Open V-Z opened 6 years ago
Hello V-Z,
The function vcfR2DNAbin
handles haploids and diploids but not higher ploids. You can access the documentation for this function as follows.
?vcfR2DNAbin
I've bolded the word ploidy there to make that section easy to find. The critical issue here is "how are you going to phase higher ploid data?" I do not know of any software that will accomplish that. You appear to be trying to set unphased_as_NA=FALSE
, if this were to be functional you would generate chimeric sequences instead of recreating the chromosomes in your organism of interest. So I see no way to implement this in a reasonable manner.
At the time I wrote vcfR2genind
all samples in the genind
object had to be the same ploidy. I checked the adegenet github site and in their ChangeLog they report that as of version 2.0 the genind object can handle mixed ploidy. I was not aware of this update, thank you for bringing it to my attention. I suspect something similar has happened for the genlight
object. I'll try to look into this
Thank You, @knausb,
for Your fast reply, I'm sorry, I overlooked that warning in documentation. For this particular purpose I wouldn't mind chimeric sequences. In any case, it's interesting which output I get with vcfR2DNAbin()
. Do You think it'd be complicated to implement it for phased data? Anyway, what does the conversion do now (for unphased diploids)? It doesn't give me any warning that something could go wrong.
Adegenet has been handling mixed ploidies pretty well, for genind
as well as genlight
, it works very well. Do You think there will be any changes in the vcfR export functions?
Hi V-Z,
I've documented the results for unphased diploids at the below link.
https://knausb.github.io/vcfR_documentation/dnabin.html
It uses IUPAC ambiguity codes. I see this as suboptimal. But I also see it as common practice. But it doesn't extend well beyond diploids.
Could you elaborate on why chimeric sequences could be useful? I can't think of a reason. And I kinda don't want to be an enabler for people to do silly things.
In order to address the adegenet
conversion functions we need a minimal reproducible example. First we'll need example data.
library(vcfR)
data("vcfR_test")
vcfR_test
vcfR_test@gt
gt4 <- matrix(nrow=5, ncol=2)
colnames(gt4) <- c("NA00004", "NA00005")
gt4[1,] <- c("1|0|0|1:48:8:51,51,51,51", "1|0|0|1:48:8:51,51,51,51")
gt4[2,] <- c("1/0/0/1:48:8:51,51,51,51", "1/0/0/1:48:8:51,51,51,51")
gt4[3,] <- c("1/2/2/1:48:8:51,51,51,51", "1/2/2/1:48:8:51,51,51,51")
gt4[4,] <- c("0/0/1/2:48:8:51,51,51,51", "1/2/0/1:48:8:51,51,51,51")
gt4[5,] <- c("1/0/0/1:48:8", "1|0|0|1:48:8")
vcfR_test@gt <- cbind(vcfR_test@gt, gt4)
vcfR_test
vcfR_test@gt
This should give us a vcfR object where the samples are mixed ploid, but each individual sample is of one ploid. Now we can try to reproduce the errors.
myGenind <- vcfR2genind(vcfR_test)
results in
Loading required namespace: adegenet
Warning in adegenet::df2genind(t(x), sep = sep, ...) :
The observed allele dosage (2-4) does not match the defined ploidy (2-2).
Please check that your input parameters (ncode, sep) are correct.
As reported, so we can work on this.
Genlight.
myGenlight <- vcfR2genlight(vcfR_test)
results in
Warning messages:
1: In vcfR2genlight(vcfR_test) : Found 2 loci with more than two alleles.
Objects of class genlight only support loci with two alleles.
2 loci will be omitted from the genlight object.
2: In initialize(value, ...) : NAs introduced by coercion
3: In initialize(value, ...) : NAs introduced by coercion
This does not appear to be the error reported. I'm using adegenet 2.1.1
and R 3.5.1
which appear to be the current versions on CRAN. @V-Z could you please report the versions of R
and adegenet
you're using as well as validate that this is not the message you received?
Hello,
I need to convert VCF into STRUCTURE input format. STRUCTURE works on frequencies of alleles, so I don't mind chimeras. FASTA can be very easily (sed
etc.) converted into STRUCTURE. I admit this may be just corner case.
My system:
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed
Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=cs_CZ.UTF-8 LC_NUMERIC=C LC_TIME=cs_CZ.UTF-8
[4] LC_COLLATE=cs_CZ.UTF-8 LC_MONETARY=cs_CZ.UTF-8 LC_MESSAGES=cs_CZ.UTF-8
[7] LC_PAPER=cs_CZ.UTF-8 LC_NAME=cs_CZ.UTF-8 LC_ADDRESS=cs_CZ.UTF-8
[10] LC_TELEPHONE=cs_CZ.UTF-8 LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=cs_CZ.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pegas_0.11 ape_5.2 adegenet_2.1.1 ade4_1.7-13 vcfR_1.8.0 rkward_0.7.0
loaded via a namespace (and not attached):
[1] gtools_3.8.1 tidyselect_0.2.5 purrr_0.2.5 reshape2_1.4.3 splines_3.5.1 lattice_0.20-38
[7] expm_0.999-3 colorspace_1.4-0 htmltools_0.3.6 viridisLite_0.3.0 mgcv_1.8-25 rlang_0.3.0.1
[13] pillar_1.3.0 later_0.7.5 glue_1.3.0 sp_1.3-1 bindrcpp_0.2.2 bindr_0.1.1
[19] plyr_1.8.4 stringr_1.3.1 munsell_0.5.0 gtable_0.2.0 coda_0.19-2 permute_0.9-4
[25] httpuv_1.4.5 parallel_3.5.1 spdep_0.7-9 Rcpp_1.0.0 pinfsc50_1.1.0 xtable_1.8-3
[31] scales_1.0.0 promises_1.0.1 gdata_2.18.0 vegan_2.5-3 mime_0.6.1 deldir_0.1-15
[37] ggplot2_3.1.0 digest_0.6.18 gmodels_2.18.1 stringi_1.2.4 dplyr_0.7.8 shiny_1.2.0
[43] grid_3.5.1 LearnBayes_2.15.1 tools_3.5.1 magrittr_1.5 lazyeval_0.2.1 tibble_1.4.2
[49] cluster_2.0.7-1 crayon_1.3.4 seqinr_3.4-6 pkgconfig_2.0.2 MASS_7.3-51.1 Matrix_1.2-15
[55] spData_0.2.9.4 assertthat_0.2.0 R6_2.3.0 boot_1.3-20 igraph_1.2.2 nlme_3.1-137
[61] compiler_3.5.1
I'm indeed not sure if You get same report regarding genelight
as I get:
> data_genlight <- vcfR2genlight(x=data_vcf, n.cores=3)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In initialize(value, ...) : NAs introduced by coercion
2: In initialize(value, ...) : NAs introduced by coercion
...
Did You check the Ploidy
slot?
If You need, I can send You the data privately (it is unpublished data set). We agree, that vcfR2loci
works fine, don't we?
I agree that vcfR2loci
appears to be working correctly. But that's a fairly simple one.
I'd prefer to work with the example data that I've generated. That way we can keep this conversation public. that's sort of what GitHub issues is about. We do not appear to be generating the same error with vcfR2genlight
, but it may be you've got more than one issue. I'll work on my example and send something to you so you can test it on your data.
Its been a while since I worked with Structure. And I've never tried it with polyploids. More recently I've used fastStructure. But if I remember correctly it uses plink
files and I do not think those support polyplpoids. Would a better solution be to create a vcfR2structure
function in vcfR? Have you looked at strataG? It has a function strataG::structure()
. Perhaps a solution would be a vcfR2gtypes
? Sorry about all the questions but I haven't been working with these sorts of questions recently and it appears we have some options. And thanks for posting your sessionInfo()
.
Hello,
OK, I'll of course test any updated function. If You'd like to test with our data in the future, let me know.
I don't use Plink, but STRUCTURE uses fairly simple format (similar to fastStructure, as far as I know), one line per genotype, it can handle various ploidy levels. Alleles are numbered (e.g. 1-4 instead of ATCG). I haven't looked at strataG::structure()
, it looks very interesting, although it uses strataG
object, which appears not to support mixed ploidies.
So far I use to use ParellelStructure. It is "just" a wrapper to run STRUCTURE in parallel from R (You need "classical" STRUCTURE input and parameter files and binary).
There are functions to handle STRUCTURE format in Adegenet, but I think they still handle only haploids or diploids. Of course, direct export from VCF into STRUCTURE would be nice. Conversion through FASTA was just the simplest way how to do it (=other functions, extra software, whatever, do not handle polyploids or mixing of ploidies well).
Yes, working with various polyploids is still bit complex. ;-)
Hmm, it appears that the path to a gtypes
object is through a multidna object which takes a DNAbin
object as input. So it appears I've gone around in a circle and am back at creating a DNAbin
object whether I like it or not. So it appears that is the path forward and I'll have to rely on warnings in the documentation.
Another issue has been raised that I feel is related to this one: https://github.com/knausb/vcfR/issues/121#issue-392703443. This is here to remind me these are related.
Hi,
I have a vcf with tetraploids generated by GATK pipeline. The file reads into R fine (I think using read.vcfR) but I got the errors listed above when I convert to genlight (vcfR2genlight),
Warning messages: 1: In initialize(value, ...) : NAs introduced by coercion 2: In initialize(value, ...) : NAs introduced by coercion ...
and the resulting genlight file says there is 100% missing data - which I'm fairly certain isn't correct.
Apologies if I have missed something obvious in the documentation - what could be causing the errors, the ploidy or something else? Thanks!
Hi @lneaves , I see this as a very different issue from this thread. You report fixed ploidy as opposed to varying ploidy. What you're asking for is very different. Please post a new issue. Also, please see this page where I've tried to provide suggestions on how to post a good issue including providing a minimal reproducible example.
Hi @knausb! I appreciate all the effort that has gone in to creating and maintaining this package.
It appears that vcfR is the only piece of software capable of doing what I need to do, which is to import a polyploid (hexaploid) dataset into adegenet. However, when I try to run vcfR2genind, I get the same error that is detailed above (quoted below as well), warning that "the observed allele dosage (0-6) does not match the defined ploidy (2-2)." I also get the error using your test data referenced below, which makes me think that maybe the package can't handle polyploids yet. On the other hand, I have seen the package referenced in several papers that deal with polyploids, which makes me think that there is a way forward.
In any case, any updates on this would be appreciated! Thanks,
D
In order to address the
adegenet
conversion functions we need a minimal reproducible example. First we'll need example data.library(vcfR) data("vcfR_test") vcfR_test vcfR_test@gt gt4 <- matrix(nrow=5, ncol=2) colnames(gt4) <- c("NA00004", "NA00005") gt4[1,] <- c("1|0|0|1:48:8:51,51,51,51", "1|0|0|1:48:8:51,51,51,51") gt4[2,] <- c("1/0/0/1:48:8:51,51,51,51", "1/0/0/1:48:8:51,51,51,51") gt4[3,] <- c("1/2/2/1:48:8:51,51,51,51", "1/2/2/1:48:8:51,51,51,51") gt4[4,] <- c("0/0/1/2:48:8:51,51,51,51", "1/2/0/1:48:8:51,51,51,51") gt4[5,] <- c("1/0/0/1:48:8", "1|0|0|1:48:8") vcfR_test@gt <- cbind(vcfR_test@gt, gt4) vcfR_test vcfR_test@gt
This should give us a vcfR object where the samples are mixed ploid, but each individual sample is of one ploid. Now we can try to reproduce the errors.
myGenind <- vcfR2genind(vcfR_test)
results in
Loading required namespace: adegenet Warning in adegenet::df2genind(t(x), sep = sep, ...) : The observed allele dosage (2-4) does not match the defined ploidy (2-2). Please check that your input parameters (ncode, sep) are correct.
As reported, so we can work on this.
Genlight.
myGenlight <- vcfR2genlight(vcfR_test)
results in
Warning messages: 1: In vcfR2genlight(vcfR_test) : Found 2 loci with more than two alleles. Objects of class genlight only support loci with two alleles. 2 loci will be omitted from the genlight object. 2: In initialize(value, ...) : NAs introduced by coercion 3: In initialize(value, ...) : NAs introduced by coercion
This does not appear to be the error reported. I'm using
adegenet 2.1.1
andR 3.5.1
which appear to be the current versions on CRAN. @V-Z could you please report the versions ofR
andadegenet
you're using as well as validate that this is not the message you received?
Hi @dylanob , I'm afraid I do not have any updates on this issue. I have not found much time to code lately. But I do consider this to be an open issue. Perhaps knowing that another individual is interested will help me raise its priority. Thanks!
Hi there. Yeah, I understand. I'm going to keep working on a work-around, but at the end of the day I'm beginning to wonder if there's much point in trying to model these as polyploid at all. I suspect that they are allopolyploid, which means that most downstream software is not going to be able to deal with their genetics, given that I don't have any knowledge of which genotypes belong to which of the chromosome sets. But at a minimum, I'd like to be able to model the genotypes as hexaploid in poppr, where the software is able to use appropriate assumptions about genotype frequencies. Thanks again!
Hello, I am having a similar issue. I have a vcf generated with GATK with diploid and triploid individuals. I'm also looking into importing my vcf into adegenet as genind but have not found a solution so far (my goal is also to use poppr in the end). Is there any update on this? Thanks.
Hi @RGoess , I'm afraid I do not have an update on this. I consider this an open issue because I would like to provide this functionality some day. But I do not currently work on any projects that have mixed ploidy samples (at least as far as we know) so this is a low priority for me. It also sounds like a high labor feature to implement, and I just haven't found the time to work on this. I'm afraid I would recommend that you do not wait for vcfR to implement this feature. Hopefully someday...
Hello, I imported
vcf
containing diploids and tetraploids asdata_vcf <- read.vcfR(file="alpine.vcf.gz", limit=1e+10, nrows=-1, skip=0, cols=NULL, convertNA=TRUE, checkFile=TRUE, check_keys=TRUE, verbose=TRUE)
and I'd like to convert it into DNAbin with extracted haplotypes (one haplotype per sequence). I usedata_dna <- vcfR2DNAbin(x=data_vcf, extract.indels=FALSE, consensus=FALSE, extract.haps=TRUE, unphased_as_NA=FALSE, asterisk_as_del=FALSE, ref.seq=NULL, start.pos=NULL, verbose=TRUE)
. The ouptut generally looks nice, with sequences named likeAA016_a_0, AA016_a_1, AA016ac_0, AA016ac_1, AA016ak_0, AA016ak_1, ...
, but I have 2 sequences also for tetraploids, not only for diploids. I tried conversion intoloci
(data_loci <- vcfR2loci(x=data_vcf)
) and there I get alleles like1/1, 0/0, 1/1/1/1, 0/0/0/0, 0/0/0/1, ...
. It looks correct. Conversion togenind
(data_genind <- vcfR2genind(x=data_vcf, sep="|")
) gives me warningThe observed allele dosage (0-7) does not match the defined ploidy (2-2). Please check that your input parameters (ncode, sep) are correct.
and@ploidy
slot shows only diploids. Conversion togenlight
(data_genlight <- vcfR2genlight(x=data_vcf, n.cores=3)
) shows warningsNAs introduced by coercion
. I'm not sure, what exactly it means here, but@ploidy
slot is empty and includedSNPbin
objects contain ploidies2
orNA
. So it seems some conversions ofvcf
with more ploidy levels do not work correctly. I didn't find much about handling of more ploidy levels in the documentation, but some conversions are obviously wrong.