Closed zkamvar closed 6 years ago
Hi Zhian, Let's separate this into the issue and proposed solutions. The issue is in how zeros are interpreted by various software. In the VCF specification, zero is the reference allele and should be fairly common. In poppr zeros are loci with ambiguous allelic dosage. So we need to handle this as we transfer VCF data to genind objects. Does this sound correct?
A possible solution would be to add one to each allele. This would convert the allele information from a zero based system to a one based system. I feel this may be a better solution, but I don't know of a good way to implement it. Because each genotype would have to be converted to alleles, add the one, and converted back to a genotype, this could incur a computationally intensive step. But if you're using a genind object you might not have a large dataset so I may be worrying about nothing.
Your solution appears to avoid this situation by using the alleles as opposed to the VCF genotypes. I'm a bit hesitant to implement this because I worry about performance. But I tend to worry about performance before it actually is an issue. I'll work on this and get back to you.
My perspective here is that it would ultimately be better for the user to return the actual alleles instead of the 0/1 representation.
A couple of reasons why I think so:
The only downside here is the performance cost. I would argue that it’s negligible even if the performance is cut in half for the following reasons
Sent from my iPhone
On Sep 27, 2018, at 00:51, Brian Knaus notifications@github.com wrote:
Hi Zhian, Let's separate this into the issue and proposed solutions. The issue is in how zeros are interpreted by various software. In the VCF specification, zero is the reference allele and should be fairly common. In poppr zeros are loci with ambiguous allelic dosage. So we need to handle this as we transfer VCF data to genind objects. Does this sound correct?
A possible solution would be to add one to each allele. This would convert the allele information from a zero based system to a one based system. I feel this may be a better solution, but I don't know of a good way to implement it. Because each genotype would have to be converted to alleles, add the one, and converted back to a genotype, this could incur a computationally intensive step. But if you're using a genind object you might not have a large dataset so I may be worrying about nothing.
Your solution appears to avoid this situation by using the alleles as opposed to the VCF genotypes. I'm a bit hesitant to implement this because I worry about performance. But I tend to worry about performance before it actually is an issue. I'll work on this and get back to you.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Also, if you are uncomfortable changing the default behavior, I would suggest adding it as an extra argument so that the user has the choice of whether or not to incur the performance cost.
I've made the changes at 5784d648052c95763bbb384da0ba471d920d9d61. The functions vcfR2genind()
and vcfR2loci()
now have the parameter return.alleles=FALSE
. I think you make a good argument about requiring people to look up what the alleles are. But I've tried to maintain data as 'VCF-like' as possible. And its always easier to do nothing than something. Can you live with that?
The changes are now on master and you can give it a try if you'd like.
devtools::install_github(repo="knausb/vcfR")
Thanks!
Thank you for making the change! I would still argue that the default to return.alleles = TRUE
would be a more sensible default for the reasons I mentioned earlier, but it's not necessarily a hill that I would die on (though I may add a PR to update the documentation with a (not run) example showing why one will want to use return.alleles = TRUE
)
Today, weird behavior was observed in poppr due to a genind object created from
vcfR2genind()
. This was due to the fact thatvcfR2genind()
will export the alleles from the vcfR object as numeric representations of the underlying nucleotide/sequence. Unfortunately, poppr interprets zeroes as placeholders for loci with ambiguous allelic doesage, which creates weird situations with these data.I will try to find a way to steel poppr against this, but in the meantime, I was wondering if you would be willing to add a
return.alleles = TRUE
option tovcfR2genind()
(andvcfR2loci()
).