knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Improving run time performance #140

Open arumds opened 5 years ago

arumds commented 5 years ago

I am processing VCF file to split the columns until the INFO field (columns 1-9) into tab separated columns. The example input has 250K variants.

To do this i am using the below code:

###STEP1: Reading the VCF file took upto approx ˜4min (196 sec)####
system.time(pop_vcfinput <- vcfR::read.vcfR("pop.vcf.gz"))
   user  system elapsed 
196.416   5.611 202.040

###STEP2: Extracting all the key/value pairs in INFO  field #####
my_INFOs <- grep("INFO", queryMETA(pop_vcf), value = TRUE)
my_INFOs <- sub("INFO=ID=", "", my_INFOs)

### STEP3a: Using lapply function to extract the values for all the keys in INFO column. This took ˜5 min (302sec) for the below function####

system.time(my_INFOm <- matrix(unlist(lapply(my_INFOs, function(x){ extract.info(pop_vcf, element = x) })),
                   ncol = length(my_INFOs), byrow = FALSE))

 user  system elapsed 
302.210   0.092 302.336 

### STEP3b: The same was implemented with mclapply function and this took ˜5 min (316 sec) ####
system.time(my_INFOm <- matrix(unlist(mclapply(my_INFOs, function(x){ extract.info(pop_vcf, element = x) })),
                   ncol = length(my_INFOs), byrow = FALSE))

   user  system elapsed 
316.789   1.800 170.253

###STEP4: The INFO fields are combined with columns 1-8 to get the final output####
colnames(my_INFOm) = as.character(my_INFOs)
popcount_out <- cbind(getFIX(pop_vcf), my_INFOm)

The input dataset with 250K variants is taking ˜10min for executing the code in STEP1 + STEP3a or STEP1+ STEP3b. And the run time is increasing with increase in variants in the input VCF file. Is it possible to reduce the runtime with some parallel processing methods in R so that it performs better with the size of the input VCF? I am not familiar with parallel processing methods, in the above code i have used lapply and mclapply for STEP3 which do not have a significant improvement.

Any suggestions to improve performance while reading the file in STEP1 and STEP3a/3b? Since this has been a performance issue and not a bug in the package to show a reproducible example, i am not able to share any input file.

knausb commented 5 years ago

Hi Mehar,

A reproducible example is very important. In the example below I process > 300K variants in seconds while you report you're taking 10 minutes. I suspect the issue is that you may have more samples than I realize. But because you've provided no data I don't know. That means I can't help you. It doesn't need to be your data, that's why I've provided example data. You could try to modify the below code to reproduce the behaviour you're experiencing.

Also, you are doing things I feel are unnecessary. One of the performance features of vcfR is it only does things on demand. If your goal is to parse an entire file you might want to try VariantAnnotation.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
library(pinfsc50)
library(microbenchmark)

vcf <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
my_vcf <- read.vcfR(vcf, verbose = FALSE)
my_vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 1 CHROMs
#> 22,031 variants
#> Object size: 22.4 Mb
#> 7.929 percent missing data
#> *****        *****         *****
# "Add" more variants.
my_vcf@fix[, "CHROM"] <- 1
for(i in 1:4){
  my_vcf2 <- my_vcf
  my_vcf2@fix[, "CHROM"] <- i + 1
  my_vcf <- rbind2(my_vcf, my_vcf2)  
}
my_vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 5 CHROMs
#> 352,496 variants
#> Object size: 90.5 Mb
#> 7.929 percent missing data
#> *****        *****         *****
write.vcf(x = my_vcf, file = "big_data.vcf.gz")
res <- microbenchmark( my_vcf <- read.vcfR("big_data.vcf.gz", verbose = FALSE),
                       times = 10, unit = "s")
my_vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 5 CHROMs
#> 352,496 variants
#> Object size: 90.5 Mb
#> 7.929 percent missing data
#> *****        *****         *****
print(res)
#> Unit: seconds
#>                                                     expr      min       lq
#>  my_vcf <- read.vcfR("big_data.vcf.gz", verbose = FALSE) 7.574603 7.594612
#>      mean   median       uq      max neval
#>  7.873175 7.634449 7.660916 10.08351    10

Created on 2019-07-09 by the reprex package (v0.3.0)

arumds commented 5 years ago

Hi Brian,

You are correct that my input file has 2500 samples and the issue could be with the sample size. Here is the big data i am trying to read which is the variant calls from chr1 from 1000G Phase3 data.

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

library(vcfR)
library(microbenchmark)

> microbenchmark(popcount_vcfinput <-read.vcfR("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"),times=10, units = "s")
Scanning file to determine attributes.
File attributes:
  meta lines: 252
  header_line: 253
  variant count: 6468094
  column count: 2513
Meta line 252 read in.
All meta lines processed.
Error: cannot allocate vector of size 121.1 Gb

I couldn't read the file atleast. Could you suggest how to make use of vcfR to process such big data.

knausb commented 5 years ago

Hi Mehar, that was the critical information I needed. The issue is that file is too large for you to read in.

memuse::howbig(nrow = 6468094, ncol = 2513)
#> 121.104 GiB

Created on 2019-07-10 by the reprex package (v0.3.0)

For context, my workstation only has 32 GB of RAM. So even if the language could theoretically create a data structure that large, my system doesn't have enough physical RAM. Also, its been my experience that R doesn't perform very well if you try to make data structures that are over 1 GB in RAM.

R is somewhat unique as a language in that it was designed to read all of your data into RAM at once. Other languages tend to read in chunks of your data perform something, dump the results to an outfile, and read in another chunk. vcfR::read.vcfR has nrow and skip parameters, so you could read in parts of the file. Also keep in mind that because R is a high level, interpreted language, it doesn't perform very well with really large datasets. I think you need to put some critical thought into what you're trying to do. Perhaps using something like vcftools may be a more efficient option?