Closed fdchevalier closed 6 years ago
Hi @fdchevalier,
Thanks for your interest in vcfR! But I'm afraid I don't feel I understand your request. What sort of behavior are you looking for? As you mentioned, we already have extract.indels()
with behavior as follows.
> data("vcfR_test")
> getFIX(vcfR_test)
CHROM POS ID REF ALT QUAL FILTER
[1,] "20" "14370" "rs6054257" "G" "A" "29" "PASS"
[2,] "20" "17330" NA "T" "A" "3" "q10"
[3,] "20" "1110696" "rs6040355" "A" "G,T" "67" "PASS"
[4,] "20" "1230237" NA "T" NA "47" "PASS"
[5,] "20" "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
> vcf1 <- extract.indels(vcfR_test)
> getFIX(vcf1)
CHROM POS ID REF ALT QUAL FILTER
[1,] "20" "14370" "rs6054257" "G" "A" "29" "PASS"
[2,] "20" "17330" NA "T" "A" "3" "q10"
[3,] "20" "1110696" "rs6040355" "A" "G,T" "67" "PASS"
[4,] "20" "1230237" NA "T" NA "47" "PASS"
> vcf2 <- extract.indels(vcfR_test, return.indels = TRUE)
> getFIX(vcf2)
CHROM POS ID REF ALT QUAL FILTER
"20" "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
It seems to me that we have functionality to sort the variants into either just indels or just SNPs. What new functionality are you looking for? Also note that this provides an example where the ALT allele includes both a deletion and and insertion which seems presents a challenge to sorting insertions from deletions. Have you considered this situation?
Thanks! Brian
Hi Brian,
Sorry for not being clear enough. The extract.indels()
function extracts indels but do not return a logical vector (TRUE
/FALSE
) of rows that contains such indels (like the is.biallelic()
function would do for biallelic sites). For instance, if I generate a data.frame from the gt
slot (let's say extracting allele frequencies), then perform operations on this data.frame and finally filter out indel sites on this data.frame for plotting purpose, I need to identified the rows from the original vcfR object which correspond to indels. By the way, I cannot extract the SNP only and then pull out the allele frequencies because I need the indel information somewhere in my downstream analysis.
For example, here is the is.indel()
function that I made derived from the extract.indels()
:
is.indel <- function (x) {
isIndel <- matrix(FALSE, nrow = nrow(x), ncol = 2)
colnames(isIndel) <- c("REF", "ALT")
isIndel[, "REF"] <- nchar(x@fix[, "REF"]) > 1
checkALT <- function(x) {
x <- stats::na.omit(x)
x <- x[x != "<NON_REF>"]
if (length(x) > 0) {
max(nchar(x)) > 1
}
else {
FALSE
}
}
isIndel[, "ALT"] <- unlist(lapply(strsplit(x@fix[, "ALT"],
split = ","), checkALT))
mask <- rowSums(isIndel) > 0
return(mask)
}
If we apply this to vcfR_test data:
data("vcfR_test")
is.indel(vcfR_test)
# FALSE FALSE FALSE FALSE TRUE
Hope this is clearer.
By the way, thank you for the example with insertion and deletion at the same site, I did not think about such example.
Fred
Hi @fdchevalier,
I've added is.indel()
to vcfR. If you'd like to give it a try you can install the GitHub version as follows.
devtools::install_github(repo="knausb/vcfR")
You will need a compiler to install the GitHub version.
This will be in vcfR 1.9.0. However, we recently released on CRAN so it may be a few months before this shows up on CRAN. If (and when if you want to try it out) you're happy with this please close the issue.
Thanks! Brian
Hi Brian,
I have tried the github version and the new function works as expected. Thank you for your efficiency.
Fred
Hello everyone,
First of, thank you for this package which makes working with VCF files in R very easy.
I used it quite intensively these past days and found that one function was missing:
is.indel
. I often used theis.biallelic
function to filter tables generated from VCF files and I found surprising that a similar function did not exist for indels.Looking at the source code, it would be easy to create it because most of the code is already present in the
extract.indels
function. Thisis.indel
function could even allow selection of either insertions or deletions only.I see how to do it with minimal changes in the code and I am ready to make a PR if you deem this useful.