vcf_field_names incorrectly parses fields that contain comma in quoted strings

ripkrizbi commented 6 years ago

Hi,

the helper function vcf_field_names appears to have problems parsing the vcf header for fields that contain the comma delimiter as part of Description. Consider these examples:

##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">

Both of the above fields will be incorrectly parsed, the second one will produce a warning but the first one will not. The reason is that tidyr::separate parses on a simple "," delimiter and these fields will be incorrectly treated.

The parsing problem in turn breaks the vcfr2tidy conversion.

https://github.com/knausb/vcfR/blob/ce9e7bd2329631ef2bdb1154fe1a5cfd5db36eb9/R/vcfR_to_tidy_functions.R#L576-L584

One possible way to get around that would be to use read.table to properly ignore commas enclosed in quotes, but I have not tested this solution extensively in all cases and combinations of varying field counts.

z <- y %>%
dplyr::filter_(~stringr::str_detect(x, left_regx)) %>%
dplyr::mutate_(x = ~stringr::str_replace(x, left_regx, "")) %>%
dplyr::mutate_(x = ~stringr::str_replace(x, ">$", ""))

y <- read.table(text = z$x, sep = ",")

...

ripkrizbi commented 6 years ago

This seems to work as well:

y %>%
  dplyr::filter_(~stringr::str_detect(x, left_regx)) %>%
  dplyr::mutate_(x = ~stringr::str_replace(x, left_regx, "")) %>%
  dplyr::mutate_(x = ~stringr::str_replace(x, ">$", "")) %>%
  tidyr::separate_("x", 
                   into=c("i", "n", "t", "d", "s", "v"), 
                   sep = "ID=|,Number=|,Type=|,Description=|,Source=|,Version=", 
                   fill = "right",
                   remove = FALSE) %>%
  dplyr::mutate_(Tag = ~tag,
                 ID = ~i,
                 Number = ~n,
                 Type = ~t,
                 Description = ~d %>% stringr::str_replace_all("\"", ""),
                 Source = ~s,
                 Version = ~v) %>%
  dplyr::select_(~Tag, ~ID, ~Number, ~Type, ~Description, ~Source, ~Version)

knausb commented 6 years ago

Hi @ripkrizbi,

Formation of the meta-information lines in VCF 4.3 is covered in section 1.4. I do not see any indication that the inclusion of commas would be considered to be malformed. And it appears to be a reasonable request. I'll try to address this. Your suggestions for solutions is greatly appreciated! Thank you for bringing this issue to my attention.

Brian

knausb commented 6 years ago

Here is a reproducible example.


library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.6.0.9000 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****

data("vcfR_test")
# Expected behavior.
vcf_field_names(vcfR_test)
#> # A tibble: 6 x 7
#>     Tag    ID Number    Type                 Description         Source
#>   <chr> <chr>  <chr>   <chr>                       <chr>          <chr>
#> 1  INFO    NS      1 Integer Number of Samples With Data           <NA>
#> 2  INFO    DP      1 Integer                 Total Depth           <NA>
#> 3  INFO    AF      A   Float            Allele Frequency           <NA>
#> 4  INFO    AA      1  String            Ancestral Allele           <NA>
#> 5  INFO    DB      0    Flag            dbSNP membership " build 129\""
#> 6  INFO    H2      0    Flag          HapMap2 membership           <NA>
#> # ... with 1 more variables: Version <chr>

myMeta <- vcfR_test@meta
vcfR_test@meta <- c(myMeta[1:12], '##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">', myMeta[13:18])
# First example of unexpected behavior.
vcf_field_names(vcfR_test)
#> # A tibble: 7 x 7
#>     Tag    ID Number    Type                                Description
#>   <chr> <chr>  <chr>   <chr>                                      <chr>
#> 1  INFO    NS      1 Integer                Number of Samples With Data
#> 2  INFO    DP      1 Integer                                Total Depth
#> 3  INFO    AF      A   Float                           Allele Frequency
#> 4  INFO    AA      1  String                           Ancestral Allele
#> 5  INFO    DB      0    Flag                           dbSNP membership
#> 6  INFO    H2      0    Flag                         HapMap2 membership
#> 7  INFO    AF      A   Float Estimated allele frequency in the range (0
#> # ... with 2 more variables: Source <chr>, Version <chr>

vcfR_test@meta[13] <- '##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">'
# Second example of unexpected behavior.
vcf_field_names(vcfR_test)
#> Warning: Too many values at 1 locations: 7
#> # A tibble: 7 x 7
#>     Tag    ID Number    Type                 Description         Source
#>   <chr> <chr>  <chr>   <chr>                       <chr>          <chr>
#> 1  INFO    NS      1 Integer Number of Samples With Data           <NA>
#> 2  INFO    DP      1 Integer                 Total Depth           <NA>
#> 3  INFO    AF      A   Float            Allele Frequency           <NA>
#> 4  INFO    AA      1  String            Ancestral Allele           <NA>
#> 5  INFO    DB      0    Flag            dbSNP membership " build 129\""
#> 6  INFO    H2      0    Flag          HapMap2 membership           <NA>
#> 7  INFO  TYPE      A  String          The type of allele     either snp
#> # ... with 1 more variables: Version <chr>

Created on 2018-02-01 by the reprex package (v0.1.1.9000).

The first example delimits record seven on the comma leaving it incomplete. The second example throws a warning. Perhaps this because it is also delimiting on the comma and has perhaps ended up with too many items for the container? This appears to be the reported behavior.

knausb commented 6 years ago

I'm not a fan of using read.table() because of performance issues. Trying to import VCF files with read.table() was one of the inspirations for vcfR. Although, this issue is specific to the meta region, so performance may not be as critical here.

Your second suggestion seems to work for the presented example. But I'm worried that it is not flexible.

This SO post suggests we can use scan() to address this problem. I think this is very similar to the read.table() solution, but scan() should be much faster.

myString <- "TYPE,Number=A,Type=String,Description=\"The type of allele, either snp, mnp, ins, del, or complex.\""
scan(text=myString, what="character", sep=",")
[1] "TYPE"                                                                  
[2] "Number=A"                                                              
[3] "Type=String"                                                           
[4] "Description=The type of allele, either snp, mnp, ins, del, or complex."

I like this solution. Unless @ripkrizbi has any strong opinions I think this should be implemented.

ripkrizbi commented 6 years ago

Hi @knausb - many thanks for looking into it! No strong opinions here, I'm in favor of any solution that gets the job done :) scan() gives satisfactory output. The only thing I did not check is how both read.table() and scan() behave with unequal field counts in different header lines (say, one contains a "Source" field while others do not etc.). K

knausb commented 6 years ago

You are at least two steps ahead of me on this :) You are correct that scan() will have problems with unequal field counts. And review of our documentation has reminded me that vcf_field_names() is only intended to manage INFO or FORMAT metadata. So in my previous comment I was trying too hard to be flexible. Your comment here is actually quite elegant. I've added it to the devel branch at 2fd4dc062e9fffeee3aa37bcea398e701d862560. Output now appears as follows.

> data("vcfR_test")
> myMeta <- vcfR_test@meta
> vcfR_test@meta <- c(myMeta[1:12], '##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">', myMeta[13:18])
> vcf_field_names(vcfR_test, tag = "INFO")
# A tibble: 7 x 7
    Tag    ID Number    Type                                   Description Source Version
  <chr> <chr>  <chr>   <chr>                                         <chr>  <chr>   <chr>
1  INFO    NS      1 Integer                   Number of Samples With Data   <NA>    <NA>
2  INFO    DP      1 Integer                                   Total Depth   <NA>    <NA>
3  INFO    AF      A   Float                              Allele Frequency   <NA>    <NA>
4  INFO    AA      1  String                              Ancestral Allele   <NA>    <NA>
5  INFO    DB      0    Flag                   dbSNP membership, build 129   <NA>    <NA>
6  INFO    H2      0    Flag                            HapMap2 membership   <NA>    <NA>
7  INFO    AF      A   Float Estimated allele frequency in the range (0,1]   <NA>    <NA>

And.

> vcfR_test@meta[13] <- '##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">'
> vcf_field_names(vcfR_test)
# A tibble: 7 x 7
    Tag    ID Number    Type                                                Description Source Version
  <chr> <chr>  <chr>   <chr>                                                      <chr>  <chr>   <chr>
1  INFO    NS      1 Integer                                Number of Samples With Data   <NA>    <NA>
2  INFO    DP      1 Integer                                                Total Depth   <NA>    <NA>
3  INFO    AF      A   Float                                           Allele Frequency   <NA>    <NA>
4  INFO    AA      1  String                                           Ancestral Allele   <NA>    <NA>
5  INFO    DB      0    Flag                                dbSNP membership, build 129   <NA>    <NA>
6  INFO    H2      0    Flag                                         HapMap2 membership   <NA>    <NA>
7  INFO  TYPE      A  String The type of allele, either snp, mnp, ins, del, or complex.   <NA>    <NA>

I'll work on getting that on the master branch. But you appear to have found something else I should address first. Thank you!

ripkrizbi commented 6 years ago

Thanks Brian @knausb ! Ahem, actually I have another suggestion on this one :) This may be completely paranoid and unnecessary, but my previous regex will not work if the field order is inverted. Again, I may have looked it up in the spec before worrying about it, but...

So, first I tried playing with tidyr::extract() but it did not produce desired effect. Then I came up with the code below.

This is a bit better solution, albeit (considerably) slower. It tokenizes each record and extracts fields and values from it and then compiles the metadata. I'm sure the tokenizer can be improved a bit to boost the performance (the whole list manipulation after string extraction could be better).

Or, just disregard this entire post in case the field order is fixed or you do not like the idea :smile:

stringi is used by stringr so no new libraries needed.

Here is the complete function:

vcf_field_names <- function(x, tag = "INFO") {

  if(class(x) != "vcfR") stop("Expecting x to be a vcfR object, not a ", class(x))

  x <- x@meta
  y <- dplyr::data_frame(x = x)  # make a data frame of it for easy manipulation

  left_regx <- paste("^##", tag, "=<", sep = "")  # regex to match and replace 

  field_regx <- "(ID=[^,]*)|(Number=[^,]*)|(Type=[^,]*)|(Description=\"[^\"]*\")|(Source=[^,]*)|(Version=[^,]*)"
  tokenize_record <- function(x){
    x %>%
      stringi::stri_extract_all_regex(field_regx) %>% 
      magrittr::set_names("a") %>% 
      .$a %>% 
      tibble::tibble() %>% 
      tidyr::separate(1, into = c("key", "val"), "(?<=\\w)=") %>% # look-behind needed to avoid '=' in Desc fields
      tidyr::spread(key, val)
  }

  options("dplyr.show_progress" = FALSE)
  fieldnames <- c("ID", "Number", "Type", "Description", "Source", "Version")
  z <- y %>%
    dplyr::filter_(~stringr::str_detect(x, left_regx)) %>%
    dplyr::mutate_(x = ~stringr::str_replace(x, left_regx, "")) %>%
    dplyr::mutate_(x = ~stringr::str_replace(x, ">$", "")) %>%
    dplyr::rowwise() %>%
    dplyr::do(tokenize_record(.))
  z[setdiff(fieldnames, names(z))] <- NA 
  z$Tag <- tag
  z %>%
    dplyr::mutate_(Description = ~Description %>% stringr::str_replace_all("\"", "")) %>%
    dplyr::select_(~Tag, ~ID, ~Number, ~Type, ~Description, ~Source, ~Version)

}

knausb commented 6 years ago

@ripkrizbi your fix is on the master branch now. You can give it a try if you'd like.

ripkrizbi commented 6 years ago

Many thanks! Will do ASAP and report back!

knausb commented 6 years ago

Oops, I think we're out of order. I have not added this comment yet. Just your previous suggestions. I still need to consider your most recent suggestion. I struggle a bit with regex, but through struggling I learn. Just wanted to make sure you knew where I was at.

ripkrizbi commented 6 years ago

All clear. Two very nice sites to test and debug regexes are https://regex101.com/ and https://www.debuggex.com/

knausb commented 6 years ago

In the VCF specification, section 1.4 covers the Meta-information lines, section 1.4.2 covers the Information lines, and section 1.4.4 covers the format field. While the examples provided all have the same order of keys in the strings, there is nothing explicitly stated about the order except that the default key are first and optional keys should follow them. So I see this as an ambiguity in the specification. If we can help the user out here I think we should. However, if we're being paranoid, the specification does not appear to limit the number of optional keys. So the creator of VCF files appears free to invent new keys. This is something the existing code does not appear to address and neither does the proposed code. While I feel this issue should be addressed, I do not feel we have an adequate solution.

knausb commented 6 years ago

I've come up with my own solution, borrowing on the existing code and the proposed code. We can include tthe proposed code as vcf_field_names2() and my solution as vcf_field_names3.

vcf_field_names2 <- function(x, tag = "INFO") {

  if(class(x) != "vcfR") stop("Expecting x to be a vcfR object, not a ", class(x))

  x <- x@meta
  y <- dplyr::data_frame(x = x)  # make a data frame of it for easy manipulation

  left_regx <- paste("^##", tag, "=<", sep = "")  # regex to match and replace 

  field_regx <- "(ID=[^,]*)|(Number=[^,]*)|(Type=[^,]*)|(Description=\"[^\"]*\")|(Source=[^,]*)|(Version=[^,]*)"
  tokenize_record <- function(x){
    x %>%
      stringi::stri_extract_all_regex(field_regx) %>% 
      magrittr::set_names("a") %>% 
      .$a %>% 
      tibble::tibble() %>% 
      tidyr::separate(1, into = c("key", "val"), "(?<=\\w)=") %>% # look-behind needed to avoid '=' in Desc fields
      tidyr::spread(key, val)
  }

  options("dplyr.show_progress" = FALSE)
  fieldnames <- c("ID", "Number", "Type", "Description", "Source", "Version")
  z <- y %>%
    dplyr::filter_(~stringr::str_detect(x, left_regx)) %>%
    dplyr::mutate_(x = ~stringr::str_replace(x, left_regx, "")) %>%
    dplyr::mutate_(x = ~stringr::str_replace(x, ">$", "")) %>%
    dplyr::rowwise() %>%
    dplyr::do(tokenize_record(.))
  z[setdiff(fieldnames, names(z))] <- NA 
  z$Tag <- tag
  z %>%
    dplyr::mutate_(Description = ~Description %>% stringr::str_replace_all("\"", "")) %>%
    dplyr::select_(~Tag, ~ID, ~Number, ~Type, ~Description, ~Source, ~Version)
}

vcf_field_names3 <- function(x, tag = "INFO"){
  if(class(x) != "vcfR") stop("Expecting x to be a vcfR object, not a ", class(x))
  if( tag != 'INFO' & tag != 'FORMAT') stop("Expecting tag to either be INFO or FORMAT")

  # Subset to tag.
  x <- x@meta
  left_regx <- paste("^##", tag, "=<", sep = "")  # regex to match and replace 
  x <- x[grep(left_regx, x)]
  # Clean up the string ends.
  x <- sub(left_regx, "", x)
  x <- sub(">$", "", x)

  # Delimit on quote protected commas.
  x <- lapply(x, function(x){scan(text=x, what="character", sep=",", quiet = TRUE)})

  # Get unique keys.
  myKeys <- unique(unlist(lapply(strsplit(unlist(x), split = "="), function(x){x[1]})))
  # Omit default keys so we can make them first.
  myKeys <- grep("^ID$|^Number$|^Type$|^Description$", myKeys, invert = TRUE, value = TRUE)
  myKeys <- c("ID", "Number", "Type", "Description", myKeys)

  myReturn <- data.frame(matrix(ncol=length(myKeys) + 1, nrow=length(x)))
  colnames(myReturn) <- c("Tag", myKeys)
  myReturn[,'Tag'] <- tag
  getValue <- function(x){
    myValue <- grep(paste("^", myKeys[i], "=", sep=""), x, value = TRUE)
    if(length(myValue) == 0){
      is.na(myValue) <- TRUE
    } else {
      myValue <- sub(".*=", "", myValue)
    }
    myValue
  }

  for(i in 1:length(myKeys)){
    myReturn[,i+1] <- unlist(lapply(x, function(x){ getValue(x) }))
  }
  tibble::as_tibble(myReturn)
}

We can benchmark all three putative solutions.

> data("vcfR_example")
> vcf
***** Object of Class vcfR *****
18 samples
1 CHROMs
2,533 variants
Object size: 2.9 Mb
8.497 percent missing data
*****        *****         *****
> microbenchmark::microbenchmark(vcf_field_names(vcf), times = 100)
Unit: milliseconds
                 expr      min       lq     mean   median       uq      max neval
 vcf_field_names(vcf) 21.82124 23.06671 27.06199 23.97198 25.88153 129.0077   100
> microbenchmark::microbenchmark(vcf_field_names2(vcf), times = 100)
Unit: milliseconds
                  expr      min       lq     mean   median       uq      max neval
 vcf_field_names2(vcf) 219.2579 224.4725 232.9412 228.8107 235.8745 327.8169   100
> microbenchmark::microbenchmark(vcf_field_names3(vcf), times = 100)
Unit: milliseconds
                  expr      min       lq     mean  median       uq      max neval
 vcf_field_names3(vcf) 2.689925 2.767529 4.018665 2.80406 2.877645 100.8496   100

So I think my solution addresses previous concerns about quote protected commas, the order of keys as we as addressing non-default keys.

data("vcfR_test")
> vcf_field_names3(vcfR_test)
# A tibble: 6 x 5
    Tag    ID Number    Type                 Description
  <chr> <chr>  <chr>   <chr>                       <chr>
1  INFO    NS      1 Integer Number of Samples With Data
2  INFO    DP      1 Integer                 Total Depth
3  INFO    AF      A   Float            Allele Frequency
4  INFO    AA      1  String            Ancestral Allele
5  INFO    DB      0    Flag dbSNP membership, build 129
6  INFO    H2      0    Flag          HapMap2 membership
> vcfR_test@meta[12] <- "##INFO=<Number=0,ID=H2,Type=Flag,Description=\"HapMap2 membership\",myKey=\"My own key, with a comma\">"
> vcf_field_names3(vcfR_test)
# A tibble: 6 x 6
    Tag    ID Number    Type                 Description                    myKey
  <chr> <chr>  <chr>   <chr>                       <chr>                    <chr>
1  INFO    NS      1 Integer Number of Samples With Data                     <NA>
2  INFO    DP      1 Integer                 Total Depth                     <NA>
3  INFO    AF      A   Float            Allele Frequency                     <NA>
4  INFO    AA      1  String            Ancestral Allele                     <NA>
5  INFO    DB      0    Flag dbSNP membership, build 129                     <NA>
6  INFO    H2      0    Flag          HapMap2 membership My own key, with a comma

I'll incorporate this change into the code.

knausb commented 6 years ago

vcf_field_names() now handles keys that are out of order and multiple optional keys ba2d8f149159fd1397bcf4be9d4e9cdaaa013e76

knausb commented 6 years ago

Hi @ripkrizbi, changes are in place on the master branch if you'd like to give them a try. We were actually hoping to get a new version pushed to CRAN last week. But it was delayed by this issue. If you'd like some time to give it a test drive please let me know. Otherwise we may try to get this sent to CRAN tomorrow.

Thanks!

ripkrizbi commented 6 years ago

hi @knausb sorry for the delay in responding! The changes work for me, many thanks! You can close this issue now.

knausb commented 6 years ago

Thank you!

knausb / vcfR

vcf_field_names incorrectly parses fields that contain comma in quoted strings #90