knausb / vcfR

Tools to work with variant call format files
249 stars 54 forks source link

is.indel function #105

Closed fdchevalier closed 6 years ago

fdchevalier commented 6 years ago

Hello everyone,

First of, thank you for this package which makes working with VCF files in R very easy.

I used it quite intensively these past days and found that one function was missing: is.indel. I often used the is.biallelic function to filter tables generated from VCF files and I found surprising that a similar function did not exist for indels.

Looking at the source code, it would be easy to create it because most of the code is already present in the extract.indels function. This is.indel function could even allow selection of either insertions or deletions only.

I see how to do it with minimal changes in the code and I am ready to make a PR if you deem this useful.

knausb commented 6 years ago

Hi @fdchevalier,

Thanks for your interest in vcfR! But I'm afraid I don't feel I understand your request. What sort of behavior are you looking for? As you mentioned, we already have extract.indels() with behavior as follows.

> data("vcfR_test")
> getFIX(vcfR_test)
     CHROM POS       ID          REF   ALT      QUAL FILTER
[1,] "20"  "14370"   "rs6054257" "G"   "A"      "29" "PASS"
[2,] "20"  "17330"   NA          "T"   "A"      "3"  "q10" 
[3,] "20"  "1110696" "rs6040355" "A"   "G,T"    "67" "PASS"
[4,] "20"  "1230237" NA          "T"   NA       "47" "PASS"
[5,] "20"  "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
> vcf1 <- extract.indels(vcfR_test)
> getFIX(vcf1)
     CHROM POS       ID          REF ALT   QUAL FILTER
[1,] "20"  "14370"   "rs6054257" "G" "A"   "29" "PASS"
[2,] "20"  "17330"   NA          "T" "A"   "3"  "q10" 
[3,] "20"  "1110696" "rs6040355" "A" "G,T" "67" "PASS"
[4,] "20"  "1230237" NA          "T" NA    "47" "PASS"
> vcf2 <- extract.indels(vcfR_test, return.indels = TRUE)
> getFIX(vcf2)
      CHROM         POS          ID         REF         ALT        QUAL      FILTER 
       "20"   "1234567" "microsat1"       "GTC"    "G,GTCT"        "50"      "PASS" 

It seems to me that we have functionality to sort the variants into either just indels or just SNPs. What new functionality are you looking for? Also note that this provides an example where the ALT allele includes both a deletion and and insertion which seems presents a challenge to sorting insertions from deletions. Have you considered this situation?

Thanks! Brian

fdchevalier commented 6 years ago

Hi Brian,

Sorry for not being clear enough. The extract.indels() function extracts indels but do not return a logical vector (TRUE/FALSE) of rows that contains such indels (like the is.biallelic() function would do for biallelic sites). For instance, if I generate a data.frame from the gt slot (let's say extracting allele frequencies), then perform operations on this data.frame and finally filter out indel sites on this data.frame for plotting purpose, I need to identified the rows from the original vcfR object which correspond to indels. By the way, I cannot extract the SNP only and then pull out the allele frequencies because I need the indel information somewhere in my downstream analysis.

For example, here is the is.indel() function that I made derived from the extract.indels():

is.indel <- function (x) {
    isIndel <- matrix(FALSE, nrow = nrow(x), ncol = 2)
    colnames(isIndel) <- c("REF", "ALT")
    isIndel[, "REF"] <- nchar(x@fix[, "REF"]) > 1
    checkALT <- function(x) {
        x <- stats::na.omit(x)
        x <- x[x != "<NON_REF>"]
        if (length(x) > 0) {
            max(nchar(x)) > 1
        }
        else {
            FALSE
        }
    }
    isIndel[, "ALT"] <- unlist(lapply(strsplit(x@fix[, "ALT"],
        split = ","), checkALT))
    mask <- rowSums(isIndel) > 0
    return(mask)
}

If we apply this to vcfR_test data:

data("vcfR_test")
is.indel(vcfR_test)
# FALSE FALSE FALSE FALSE  TRUE

Hope this is clearer.

By the way, thank you for the example with insertion and deletion at the same site, I did not think about such example.

Fred

knausb commented 6 years ago

Hi @fdchevalier,

I've added is.indel() to vcfR. If you'd like to give it a try you can install the GitHub version as follows.

devtools::install_github(repo="knausb/vcfR")

You will need a compiler to install the GitHub version.

This will be in vcfR 1.9.0. However, we recently released on CRAN so it may be a few months before this shows up on CRAN. If (and when if you want to try it out) you're happy with this please close the issue.

Thanks! Brian

fdchevalier commented 6 years ago

Hi Brian,

I have tried the github version and the new function works as expected. Thank you for your efficiency.

Fred