knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

is.biallelic not working the way you would expect it #142

Closed ckastall closed 5 years ago

ckastall commented 5 years ago

Hi, thanks for the work on vcfR!

I noticed a small error using is.biallelic: the description of the function says it queries the gt slot (as does is.polymorphic), but contrary to is.polymorphic, the function is.biallelic tests the fix slot. At the very least the description of is.biallelic in the helper should reflect clearly the fact that it does not look at the actual gt data. A better (and a lot more useful) correction would be to make is.biallelic really look at the gt table.

I'm using version 1.8.0 (fresh install from CRAN). Best

knausb commented 5 years ago

Hi @ckastall , you are correct and I now see how that could cause confusion. I've added the following to my development version.

#' Note that \strong{is_bialleleic} queries the ALT column in the fix slot to count alleles.
#' If you remove samples from the gt slot you may invalidate the information in the fix slot.
#' For example, if you remove the samples with the alternate allele you will make the position invariant and this function will provide inaccurate information.
#' So use caution if you've made many modifications to your data.

Does that make it more clear?

I agree that philosophically it would be better to check the actual genotypes. However, I'm worried that with large datasets that would come with a big cost in performance. Thanks!

ckastall commented 5 years ago

Yup, that's good with me.

Do note however that sub sampling is not the only case where this problem might arise, eg if the reference allele is one that is different from all of the mapped samples. This should be pretty rare if you map samples to a reference genome of the same species (and close lineage), but when you map samples of species A to reference genome of species B, this becomes a noticeable issue. You will have SNPs fixed in species A falsely identified as bi.allelic (actually monomorphic positions in species A) and bi-allelic SNPs in species A that appear like having 3 possible alleles in the vcf file.

I agree that checking all gt fields would be prohibitive in the function is.biallelic. Maybe an option to check for additional information could be added to read.vcfR?

Anyway, I'll close this issue, thanks for the quick reaction!