morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

Can we change the logic in `get_manta_sv()`? #188

Closed Kdreval closed 1 year ago

Kdreval commented 1 year ago

Currently, the get_manta_sv() calls get_gambl_metadata() internally even though these_samples_metadata is a specified argument to determine samples missing from flatfile. This results in a situation where even for one sample missing from flatfile, we collect and import bedpes from the entire GAMBL and then function subsets resulting df to that one sample. This results in an unnecessarily long processing time. This is a small example for a random sample missing from flatfile:

setwd("~/GAMBLR/")

library(GAMBLR)
library(dplyr)

sample_meta <- get_gambl_metadata() %>%
    filter(sample_id == "01-16433_tumorC")

start.time <- Sys.time()
get_manta_sv(these_samples_metadata = sample_meta)
end.time <- Sys.time()
end.time - start.time

Time difference of 3.023975 mins

This is a best-case scenario because during peak hours it can be ~ 4 min for one sample.

The fix I think is to replace the line here with

all_meta <- these_samples_metadata
mattssca commented 1 year ago

At first glimpse, I like this solution and I can definitely understand the issue that you are describing here. I can make the necessary adjustments on my branch, do some tests, and push to my active PR. Thanks, Kostia!