Closed kdm9 closed 3 years ago
Ah dear, I was worried that might come back to bite me. It looks though like just putting quotes around the -r
string will fix it? That's easy enough to do...
...until some clever bugger uses both |
and '
in a contig name :smile:
but yes, that would certainly fix the immediate issue. Better yet, wrap the region in the R equivalent of python's shlex.quote()
.
Out of interest, the following seems to be working for me as a replacement window function minter that uses windowlickr
windowlickr_windowfun <- function (file, size=20000, samples=NULL, ...) {
if (is.null(samples)) { samples = windowlickr:::bcf_getSamples(file) }
windows = windowlickr:::bcf_getWindows(file, windowsize=size, slide=size)
pos.fn <- function(n, ...) {
windows[n,] %>%
dplyr::select(chrom=contig, start, end=stop)
}
win.fn <- function (n,...) {
if (n > nrow(windows)) stop(paste0("No such window: ", n))
region = windows[n, "region"]
ret = windowlickr:::readBCFQuery_(file, region, samples)
nsnp = length(ret$POS)
if (nsnp < 1) {
return(NULL)
}
GT = matrix(unlist(ret$GT, recursive = F), nrow=nsnp, byrow = T)
return(GT)
}
attr(win.fn,"max.n") <- nrow(windows)
attr(win.fn,"region") <- pos.fn
attr(win.fn,"samples") <- samples
class(win.fn) <- c("winfun", "function")
return(win.fn)
}
...until some clever bugger uses both | and ' in a contig name smile
Right, well...
A short google didn't turn up the R function to do the quoting, do you happen to know what it is?
A short google didn't turn up the R function to do the quoting, do you happen to know what it is?
Perhaps shQuote
? I'm more a python guy so i've not used system() and friends in R much before, i'm not sure if that function fully does all we need.
NB: any solution like shquote should be applied to all other bits of the command line too, namely sample names and file names (at least)
I"ve switched #21 over to using shQuote
and got github CI to automatically run the tests! Assuming it passes the tests (did on my computer) I'll merge it in, and perhaps you could verify it fixes your problem?
Hi Peter,
Thanks for another great tool. I'm having issues running lostruct with a new dataset that has contig names along the lines of
${numeric_id}|${assembly}
, e.g.32351|arrow
. This breaks the logic used to extract variant data as the pipe is unescaped in the call tosystem()
, and therefore the bcftools query command gets piped into the (thankfully non-existant)arrow
command (i.e.bcftools query -f '[ %GT]\n' -r 000000F|arrow:280707-300706 -s KA17BG005,Cala2,...
).While "then don't use | in your contig names" is probably valid advice, I'm not able to change it in this dataset without breaking a lot of other people's work, so I'll have to come up with a more robust workaround.
I know of the python implementation, whose use of cyvcf2 should fix this, though my impression was that it's not quite ready for production use. Is that correct?
As it happens, I have an R library called WindowLickR that does exactly this type of V/BCF query using an Rcpp version of the logic you have here, calling htslib rather than
system("bcftools ...")
. For now I will hack something together myself using windowlickr to provide a "window function" that takesn
and returns the genotypes as your code would. I'm happy to try fixing the issue for everyone by upstreaming this solution using windowlicker, if a dependency on that is acceptable to you.Best, Kevin