knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

s3read_using lasts forever reading a VCF file #167

Closed HediaTnani closed 4 years ago

HediaTnani commented 4 years ago

Hi,

Is there a way to speed the reading process of a VCF file from the 3kricegneome public s3 bucket?

Could this process be parallelized?

And how about reading many VCFs at the same time?

Thanks alot.

load package

library(aws.s3) library(vcfR)

s3read_using(FUN = read.vcfR, bucket = '3kricegenome', object = "9311/B183.snp.vcf.gz")

session info for your system

sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS i7 8Gb RAM

knausb commented 4 years ago

Hi @HediaTnani ,

It sounds to me like you're trying to read in 3,000 rice genomes. Which is a tremendous amount of data. I feel that one of R's strengths is how interactive it is. But R was not really designed for high performance. Its a slow language. I've improved this a lot in vcfR by using Rcpp, but it will never compete with a compiled code solution. If you have a really large amount of data, such as you appear to have, I would suggest using vcfR if you need to prototype your actions on a subset of the data. But use a compiled language to process the entire dataset. The software VCFtools is a great option.

https://vcftools.github.io/documentation.html

Note that they have compiled modules and interpreted modules (perl). The compiled modules should perform much better than an interpreted language (such as perl or R).

Good luck! Brian