knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Can we have vcfR2genind return alleles by option/default? #115

Closed zkamvar closed 6 years ago

zkamvar commented 6 years ago

Today, weird behavior was observed in poppr due to a genind object created from vcfR2genind(). This was due to the fact that vcfR2genind() will export the alleles from the vcfR object as numeric representations of the underlying nucleotide/sequence. Unfortunately, poppr interprets zeroes as placeholders for loci with ambiguous allelic doesage, which creates weird situations with these data.

I will try to find a way to steel poppr against this, but in the meantime, I was wondering if you would be willing to add a return.alleles = TRUE option to vcfR2genind() (and vcfR2loci()).

knausb commented 6 years ago

Hi Zhian, Let's separate this into the issue and proposed solutions. The issue is in how zeros are interpreted by various software. In the VCF specification, zero is the reference allele and should be fairly common. In poppr zeros are loci with ambiguous allelic dosage. So we need to handle this as we transfer VCF data to genind objects. Does this sound correct?

A possible solution would be to add one to each allele. This would convert the allele information from a zero based system to a one based system. I feel this may be a better solution, but I don't know of a good way to implement it. Because each genotype would have to be converted to alleles, add the one, and converted back to a genotype, this could incur a computationally intensive step. But if you're using a genind object you might not have a large dataset so I may be worrying about nothing.

Your solution appears to avoid this situation by using the alleles as opposed to the VCF genotypes. I'm a bit hesitant to implement this because I worry about performance. But I tend to worry about performance before it actually is an issue. I'll work on this and get back to you.

zkamvar commented 6 years ago

My perspective here is that it would ultimately be better for the user to return the actual alleles instead of the 0/1 representation.

A couple of reasons why I think so:

  1. It avoids this unexpected error with popper until I figure out how to fix it.
  2. It’s more consistent with the way pegas handles VCF data—they import the alleles explicitly.
  3. It makes more intuitive sense to the users. Adding the aspect of having them use a lookup table to determine what 0 or 1 (or 2...) means can lead to human error.

The only downside here is the performance cost. I would argue that it’s negligible even if the performance is cut in half for the following reasons

  1. For most users, this step would be performed once.
  2. The users should probably be converting to genlight.
  3. The benefits listed above.

Sent from my iPhone

On Sep 27, 2018, at 00:51, Brian Knaus notifications@github.com wrote:

Hi Zhian, Let's separate this into the issue and proposed solutions. The issue is in how zeros are interpreted by various software. In the VCF specification, zero is the reference allele and should be fairly common. In poppr zeros are loci with ambiguous allelic dosage. So we need to handle this as we transfer VCF data to genind objects. Does this sound correct?

A possible solution would be to add one to each allele. This would convert the allele information from a zero based system to a one based system. I feel this may be a better solution, but I don't know of a good way to implement it. Because each genotype would have to be converted to alleles, add the one, and converted back to a genotype, this could incur a computationally intensive step. But if you're using a genind object you might not have a large dataset so I may be worrying about nothing.

Your solution appears to avoid this situation by using the alleles as opposed to the VCF genotypes. I'm a bit hesitant to implement this because I worry about performance. But I tend to worry about performance before it actually is an issue. I'll work on this and get back to you.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

zkamvar commented 6 years ago

Also, if you are uncomfortable changing the default behavior, I would suggest adding it as an extra argument so that the user has the choice of whether or not to incur the performance cost.

knausb commented 6 years ago

I've made the changes at 5784d648052c95763bbb384da0ba471d920d9d61. The functions vcfR2genind() and vcfR2loci() now have the parameter return.alleles=FALSE. I think you make a good argument about requiring people to look up what the alleles are. But I've tried to maintain data as 'VCF-like' as possible. And its always easier to do nothing than something. Can you live with that?

The changes are now on master and you can give it a try if you'd like.

devtools::install_github(repo="knausb/vcfR")

Thanks!

zkamvar commented 6 years ago

Thank you for making the change! I would still argue that the default to return.alleles = TRUE would be a more sensible default for the reasons I mentioned earlier, but it's not necessarily a hill that I would die on (though I may add a PR to update the documentation with a (not run) example showing why one will want to use return.alleles = TRUE)