Closed ripkrizbi closed 6 years ago
This seems to work as well:
y %>%
dplyr::filter_(~stringr::str_detect(x, left_regx)) %>%
dplyr::mutate_(x = ~stringr::str_replace(x, left_regx, "")) %>%
dplyr::mutate_(x = ~stringr::str_replace(x, ">$", "")) %>%
tidyr::separate_("x",
into=c("i", "n", "t", "d", "s", "v"),
sep = "ID=|,Number=|,Type=|,Description=|,Source=|,Version=",
fill = "right",
remove = FALSE) %>%
dplyr::mutate_(Tag = ~tag,
ID = ~i,
Number = ~n,
Type = ~t,
Description = ~d %>% stringr::str_replace_all("\"", ""),
Source = ~s,
Version = ~v) %>%
dplyr::select_(~Tag, ~ID, ~Number, ~Type, ~Description, ~Source, ~Version)
Hi @ripkrizbi,
Formation of the meta-information lines in VCF 4.3 is covered in section 1.4. I do not see any indication that the inclusion of commas would be considered to be malformed. And it appears to be a reasonable request. I'll try to address this. Your suggestions for solutions is greatly appreciated! Thank you for bringing this issue to my attention.
Brian
Here is a reproducible example.
library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.6.0.9000
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
data("vcfR_test")
# Expected behavior.
vcf_field_names(vcfR_test)
#> # A tibble: 6 x 7
#> Tag ID Number Type Description Source
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 INFO NS 1 Integer Number of Samples With Data <NA>
#> 2 INFO DP 1 Integer Total Depth <NA>
#> 3 INFO AF A Float Allele Frequency <NA>
#> 4 INFO AA 1 String Ancestral Allele <NA>
#> 5 INFO DB 0 Flag dbSNP membership " build 129\""
#> 6 INFO H2 0 Flag HapMap2 membership <NA>
#> # ... with 1 more variables: Version <chr>
myMeta <- vcfR_test@meta
vcfR_test@meta <- c(myMeta[1:12], '##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">', myMeta[13:18])
# First example of unexpected behavior.
vcf_field_names(vcfR_test)
#> # A tibble: 7 x 7
#> Tag ID Number Type Description
#> <chr> <chr> <chr> <chr> <chr>
#> 1 INFO NS 1 Integer Number of Samples With Data
#> 2 INFO DP 1 Integer Total Depth
#> 3 INFO AF A Float Allele Frequency
#> 4 INFO AA 1 String Ancestral Allele
#> 5 INFO DB 0 Flag dbSNP membership
#> 6 INFO H2 0 Flag HapMap2 membership
#> 7 INFO AF A Float Estimated allele frequency in the range (0
#> # ... with 2 more variables: Source <chr>, Version <chr>
vcfR_test@meta[13] <- '##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">'
# Second example of unexpected behavior.
vcf_field_names(vcfR_test)
#> Warning: Too many values at 1 locations: 7
#> # A tibble: 7 x 7
#> Tag ID Number Type Description Source
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 INFO NS 1 Integer Number of Samples With Data <NA>
#> 2 INFO DP 1 Integer Total Depth <NA>
#> 3 INFO AF A Float Allele Frequency <NA>
#> 4 INFO AA 1 String Ancestral Allele <NA>
#> 5 INFO DB 0 Flag dbSNP membership " build 129\""
#> 6 INFO H2 0 Flag HapMap2 membership <NA>
#> 7 INFO TYPE A String The type of allele either snp
#> # ... with 1 more variables: Version <chr>
Created on 2018-02-01 by the reprex package (v0.1.1.9000).
The first example delimits record seven on the comma leaving it incomplete. The second example throws a warning. Perhaps this because it is also delimiting on the comma and has perhaps ended up with too many items for the container? This appears to be the reported behavior.
I'm not a fan of using read.table()
because of performance issues. Trying to import VCF files with read.table()
was one of the inspirations for vcfR. Although, this issue is specific to the meta region, so performance may not be as critical here.
Your second suggestion seems to work for the presented example. But I'm worried that it is not flexible.
This SO post suggests we can use scan()
to address this problem. I think this is very similar to the read.table()
solution, but scan()
should be much faster.
myString <- "TYPE,Number=A,Type=String,Description=\"The type of allele, either snp, mnp, ins, del, or complex.\""
scan(text=myString, what="character", sep=",")
[1] "TYPE"
[2] "Number=A"
[3] "Type=String"
[4] "Description=The type of allele, either snp, mnp, ins, del, or complex."
I like this solution. Unless @ripkrizbi has any strong opinions I think this should be implemented.
Hi @knausb -
many thanks for looking into it! No strong opinions here, I'm in favor of any solution that gets the job done :)
scan()
gives satisfactory output. The only thing I did not check is how both read.table()
and scan()
behave with unequal field counts in different header lines (say, one contains a "Source" field while others do not etc.).
K
You are at least two steps ahead of me on this :) You are correct that scan()
will have problems with unequal field counts. And review of our documentation has reminded me that vcf_field_names()
is only intended to manage INFO
or FORMAT
metadata. So in my previous comment I was trying too hard to be flexible. Your comment here is actually quite elegant. I've added it to the devel branch at 2fd4dc062e9fffeee3aa37bcea398e701d862560. Output now appears as follows.
> data("vcfR_test")
> myMeta <- vcfR_test@meta
> vcfR_test@meta <- c(myMeta[1:12], '##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">', myMeta[13:18])
> vcf_field_names(vcfR_test, tag = "INFO")
# A tibble: 7 x 7
Tag ID Number Type Description Source Version
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 INFO NS 1 Integer Number of Samples With Data <NA> <NA>
2 INFO DP 1 Integer Total Depth <NA> <NA>
3 INFO AF A Float Allele Frequency <NA> <NA>
4 INFO AA 1 String Ancestral Allele <NA> <NA>
5 INFO DB 0 Flag dbSNP membership, build 129 <NA> <NA>
6 INFO H2 0 Flag HapMap2 membership <NA> <NA>
7 INFO AF A Float Estimated allele frequency in the range (0,1] <NA> <NA>
And.
> vcfR_test@meta[13] <- '##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">'
> vcf_field_names(vcfR_test)
# A tibble: 7 x 7
Tag ID Number Type Description Source Version
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 INFO NS 1 Integer Number of Samples With Data <NA> <NA>
2 INFO DP 1 Integer Total Depth <NA> <NA>
3 INFO AF A Float Allele Frequency <NA> <NA>
4 INFO AA 1 String Ancestral Allele <NA> <NA>
5 INFO DB 0 Flag dbSNP membership, build 129 <NA> <NA>
6 INFO H2 0 Flag HapMap2 membership <NA> <NA>
7 INFO TYPE A String The type of allele, either snp, mnp, ins, del, or complex. <NA> <NA>
I'll work on getting that on the master branch. But you appear to have found something else I should address first. Thank you!
Thanks Brian @knausb ! Ahem, actually I have another suggestion on this one :) This may be completely paranoid and unnecessary, but my previous regex will not work if the field order is inverted. Again, I may have looked it up in the spec before worrying about it, but...
So, first I tried playing with tidyr::extract()
but it did not produce desired effect. Then I came up with the code below.
This is a bit better solution, albeit (considerably) slower. It tokenizes each record and extracts fields and values from it and then compiles the metadata. I'm sure the tokenizer can be improved a bit to boost the performance (the whole list manipulation after string extraction could be better).
Or, just disregard this entire post in case the field order is fixed or you do not like the idea :smile:
stringi
is used by stringr
so no new libraries needed.
Here is the complete function:
vcf_field_names <- function(x, tag = "INFO") {
if(class(x) != "vcfR") stop("Expecting x to be a vcfR object, not a ", class(x))
x <- x@meta
y <- dplyr::data_frame(x = x) # make a data frame of it for easy manipulation
left_regx <- paste("^##", tag, "=<", sep = "") # regex to match and replace
field_regx <- "(ID=[^,]*)|(Number=[^,]*)|(Type=[^,]*)|(Description=\"[^\"]*\")|(Source=[^,]*)|(Version=[^,]*)"
tokenize_record <- function(x){
x %>%
stringi::stri_extract_all_regex(field_regx) %>%
magrittr::set_names("a") %>%
.$a %>%
tibble::tibble() %>%
tidyr::separate(1, into = c("key", "val"), "(?<=\\w)=") %>% # look-behind needed to avoid '=' in Desc fields
tidyr::spread(key, val)
}
options("dplyr.show_progress" = FALSE)
fieldnames <- c("ID", "Number", "Type", "Description", "Source", "Version")
z <- y %>%
dplyr::filter_(~stringr::str_detect(x, left_regx)) %>%
dplyr::mutate_(x = ~stringr::str_replace(x, left_regx, "")) %>%
dplyr::mutate_(x = ~stringr::str_replace(x, ">$", "")) %>%
dplyr::rowwise() %>%
dplyr::do(tokenize_record(.))
z[setdiff(fieldnames, names(z))] <- NA
z$Tag <- tag
z %>%
dplyr::mutate_(Description = ~Description %>% stringr::str_replace_all("\"", "")) %>%
dplyr::select_(~Tag, ~ID, ~Number, ~Type, ~Description, ~Source, ~Version)
}
@ripkrizbi your fix is on the master branch now. You can give it a try if you'd like.
Many thanks! Will do ASAP and report back!
Oops, I think we're out of order. I have not added this comment yet. Just your previous suggestions. I still need to consider your most recent suggestion. I struggle a bit with regex, but through struggling I learn. Just wanted to make sure you knew where I was at.
All clear. Two very nice sites to test and debug regexes are https://regex101.com/ and https://www.debuggex.com/
In the VCF specification, section 1.4 covers the Meta-information lines, section 1.4.2 covers the Information lines, and section 1.4.4 covers the format field. While the examples provided all have the same order of keys in the strings, there is nothing explicitly stated about the order except that the default key are first and optional keys should follow them. So I see this as an ambiguity in the specification. If we can help the user out here I think we should. However, if we're being paranoid, the specification does not appear to limit the number of optional keys. So the creator of VCF files appears free to invent new keys. This is something the existing code does not appear to address and neither does the proposed code. While I feel this issue should be addressed, I do not feel we have an adequate solution.
I've come up with my own solution, borrowing on the existing code and the proposed code. We can include tthe proposed code as vcf_field_names2()
and my solution as vcf_field_names3
.
vcf_field_names2 <- function(x, tag = "INFO") {
if(class(x) != "vcfR") stop("Expecting x to be a vcfR object, not a ", class(x))
x <- x@meta
y <- dplyr::data_frame(x = x) # make a data frame of it for easy manipulation
left_regx <- paste("^##", tag, "=<", sep = "") # regex to match and replace
field_regx <- "(ID=[^,]*)|(Number=[^,]*)|(Type=[^,]*)|(Description=\"[^\"]*\")|(Source=[^,]*)|(Version=[^,]*)"
tokenize_record <- function(x){
x %>%
stringi::stri_extract_all_regex(field_regx) %>%
magrittr::set_names("a") %>%
.$a %>%
tibble::tibble() %>%
tidyr::separate(1, into = c("key", "val"), "(?<=\\w)=") %>% # look-behind needed to avoid '=' in Desc fields
tidyr::spread(key, val)
}
options("dplyr.show_progress" = FALSE)
fieldnames <- c("ID", "Number", "Type", "Description", "Source", "Version")
z <- y %>%
dplyr::filter_(~stringr::str_detect(x, left_regx)) %>%
dplyr::mutate_(x = ~stringr::str_replace(x, left_regx, "")) %>%
dplyr::mutate_(x = ~stringr::str_replace(x, ">$", "")) %>%
dplyr::rowwise() %>%
dplyr::do(tokenize_record(.))
z[setdiff(fieldnames, names(z))] <- NA
z$Tag <- tag
z %>%
dplyr::mutate_(Description = ~Description %>% stringr::str_replace_all("\"", "")) %>%
dplyr::select_(~Tag, ~ID, ~Number, ~Type, ~Description, ~Source, ~Version)
}
vcf_field_names3 <- function(x, tag = "INFO"){
if(class(x) != "vcfR") stop("Expecting x to be a vcfR object, not a ", class(x))
if( tag != 'INFO' & tag != 'FORMAT') stop("Expecting tag to either be INFO or FORMAT")
# Subset to tag.
x <- x@meta
left_regx <- paste("^##", tag, "=<", sep = "") # regex to match and replace
x <- x[grep(left_regx, x)]
# Clean up the string ends.
x <- sub(left_regx, "", x)
x <- sub(">$", "", x)
# Delimit on quote protected commas.
x <- lapply(x, function(x){scan(text=x, what="character", sep=",", quiet = TRUE)})
# Get unique keys.
myKeys <- unique(unlist(lapply(strsplit(unlist(x), split = "="), function(x){x[1]})))
# Omit default keys so we can make them first.
myKeys <- grep("^ID$|^Number$|^Type$|^Description$", myKeys, invert = TRUE, value = TRUE)
myKeys <- c("ID", "Number", "Type", "Description", myKeys)
myReturn <- data.frame(matrix(ncol=length(myKeys) + 1, nrow=length(x)))
colnames(myReturn) <- c("Tag", myKeys)
myReturn[,'Tag'] <- tag
getValue <- function(x){
myValue <- grep(paste("^", myKeys[i], "=", sep=""), x, value = TRUE)
if(length(myValue) == 0){
is.na(myValue) <- TRUE
} else {
myValue <- sub(".*=", "", myValue)
}
myValue
}
for(i in 1:length(myKeys)){
myReturn[,i+1] <- unlist(lapply(x, function(x){ getValue(x) }))
}
tibble::as_tibble(myReturn)
}
We can benchmark all three putative solutions.
> data("vcfR_example")
> vcf
***** Object of Class vcfR *****
18 samples
1 CHROMs
2,533 variants
Object size: 2.9 Mb
8.497 percent missing data
***** ***** *****
> microbenchmark::microbenchmark(vcf_field_names(vcf), times = 100)
Unit: milliseconds
expr min lq mean median uq max neval
vcf_field_names(vcf) 21.82124 23.06671 27.06199 23.97198 25.88153 129.0077 100
> microbenchmark::microbenchmark(vcf_field_names2(vcf), times = 100)
Unit: milliseconds
expr min lq mean median uq max neval
vcf_field_names2(vcf) 219.2579 224.4725 232.9412 228.8107 235.8745 327.8169 100
> microbenchmark::microbenchmark(vcf_field_names3(vcf), times = 100)
Unit: milliseconds
expr min lq mean median uq max neval
vcf_field_names3(vcf) 2.689925 2.767529 4.018665 2.80406 2.877645 100.8496 100
So I think my solution addresses previous concerns about quote protected commas, the order of keys as we as addressing non-default keys.
data("vcfR_test")
> vcf_field_names3(vcfR_test)
# A tibble: 6 x 5
Tag ID Number Type Description
<chr> <chr> <chr> <chr> <chr>
1 INFO NS 1 Integer Number of Samples With Data
2 INFO DP 1 Integer Total Depth
3 INFO AF A Float Allele Frequency
4 INFO AA 1 String Ancestral Allele
5 INFO DB 0 Flag dbSNP membership, build 129
6 INFO H2 0 Flag HapMap2 membership
> vcfR_test@meta[12] <- "##INFO=<Number=0,ID=H2,Type=Flag,Description=\"HapMap2 membership\",myKey=\"My own key, with a comma\">"
> vcf_field_names3(vcfR_test)
# A tibble: 6 x 6
Tag ID Number Type Description myKey
<chr> <chr> <chr> <chr> <chr> <chr>
1 INFO NS 1 Integer Number of Samples With Data <NA>
2 INFO DP 1 Integer Total Depth <NA>
3 INFO AF A Float Allele Frequency <NA>
4 INFO AA 1 String Ancestral Allele <NA>
5 INFO DB 0 Flag dbSNP membership, build 129 <NA>
6 INFO H2 0 Flag HapMap2 membership My own key, with a comma
I'll incorporate this change into the code.
vcf_field_names() now handles keys that are out of order and multiple optional keys ba2d8f149159fd1397bcf4be9d4e9cdaaa013e76
Hi @ripkrizbi, changes are in place on the master branch if you'd like to give them a try. We were actually hoping to get a new version pushed to CRAN last week. But it was delayed by this issue. If you'd like some time to give it a test drive please let me know. Otherwise we may try to get this sent to CRAN tomorrow.
Thanks!
hi @knausb sorry for the delay in responding! The changes work for me, many thanks! You can close this issue now.
Thank you!
Hi,
the helper function
vcf_field_names
appears to have problems parsing the vcf header for fields that contain the comma delimiter as part of Description. Consider these examples:Both of the above fields will be incorrectly parsed, the second one will produce a warning but the first one will not. The reason is that tidyr::separate parses on a simple "," delimiter and these fields will be incorrectly treated.
The parsing problem in turn breaks the vcfr2tidy conversion.
https://github.com/knausb/vcfR/blob/ce9e7bd2329631ef2bdb1154fe1a5cfd5db36eb9/R/vcfR_to_tidy_functions.R#L576-L584
One possible way to get around that would be to use
read.table
to properly ignore commas enclosed in quotes, but I have not tested this solution extensively in all cases and combinations of varying field counts.