petrelharp / local_pca

Methods for examining PCA locally along the genome.
71 stars 13 forks source link

Issues with | in contig names #20

Closed kdm9 closed 3 years ago

kdm9 commented 3 years ago

Hi Peter,

Thanks for another great tool. I'm having issues running lostruct with a new dataset that has contig names along the lines of ${numeric_id}|${assembly}, e.g. 32351|arrow. This breaks the logic used to extract variant data as the pipe is unescaped in the call to system(), and therefore the bcftools query command gets piped into the (thankfully non-existant) arrow command (i.e. bcftools query -f '[ %GT]\n' -r 000000F|arrow:280707-300706 -s KA17BG005,Cala2,...).

While "then don't use | in your contig names" is probably valid advice, I'm not able to change it in this dataset without breaking a lot of other people's work, so I'll have to come up with a more robust workaround.

I know of the python implementation, whose use of cyvcf2 should fix this, though my impression was that it's not quite ready for production use. Is that correct?

As it happens, I have an R library called WindowLickR that does exactly this type of V/BCF query using an Rcpp version of the logic you have here, calling htslib rather than system("bcftools ..."). For now I will hack something together myself using windowlickr to provide a "window function" that takes n and returns the genotypes as your code would. I'm happy to try fixing the issue for everyone by upstreaming this solution using windowlicker, if a dependency on that is acceptable to you.

Best, Kevin

petrelharp commented 3 years ago

Ah dear, I was worried that might come back to bite me. It looks though like just putting quotes around the -r string will fix it? That's easy enough to do...

kdm9 commented 3 years ago

...until some clever bugger uses both | and ' in a contig name :smile:

but yes, that would certainly fix the immediate issue. Better yet, wrap the region in the R equivalent of python's shlex.quote().

kdm9 commented 3 years ago

Out of interest, the following seems to be working for me as a replacement window function minter that uses windowlickr

windowlickr_windowfun <- function (file, size=20000, samples=NULL, ...) {
    if (is.null(samples)) { samples = windowlickr:::bcf_getSamples(file) }
    windows = windowlickr:::bcf_getWindows(file, windowsize=size, slide=size)
    pos.fn <- function(n, ...) {
        windows[n,] %>%
            dplyr::select(chrom=contig, start, end=stop)
    }
    win.fn <- function (n,...) {
        if (n > nrow(windows)) stop(paste0("No such window: ", n))
        region = windows[n, "region"]
        ret = windowlickr:::readBCFQuery_(file, region, samples)
        nsnp = length(ret$POS)
        if (nsnp < 1) {
          return(NULL)
        }
        GT = matrix(unlist(ret$GT, recursive = F), nrow=nsnp, byrow = T)
        return(GT)
    }
    attr(win.fn,"max.n") <- nrow(windows)
    attr(win.fn,"region") <- pos.fn
    attr(win.fn,"samples") <- samples
    class(win.fn) <- c("winfun", "function")
    return(win.fn)
}
petrelharp commented 3 years ago

...until some clever bugger uses both | and ' in a contig name smile

Right, well...

A short google didn't turn up the R function to do the quoting, do you happen to know what it is?

kdm9 commented 3 years ago

A short google didn't turn up the R function to do the quoting, do you happen to know what it is?

Perhaps shQuote? I'm more a python guy so i've not used system() and friends in R much before, i'm not sure if that function fully does all we need.

kdm9 commented 3 years ago

NB: any solution like shquote should be applied to all other bits of the command line too, namely sample names and file names (at least)

petrelharp commented 3 years ago

I"ve switched #21 over to using shQuote and got github CI to automatically run the tests! Assuming it passes the tests (did on my computer) I'll merge it in, and perhaps you could verify it fixes your problem?