Closed HediaTnani closed 4 years ago
Hi @HediaTnani ,
It sounds to me like you're trying to read in 3,000 rice genomes. Which is a tremendous amount of data. I feel that one of R's strengths is how interactive it is. But R was not really designed for high performance. Its a slow language. I've improved this a lot in vcfR by using Rcpp, but it will never compete with a compiled code solution. If you have a really large amount of data, such as you appear to have, I would suggest using vcfR if you need to prototype your actions on a subset of the data. But use a compiled language to process the entire dataset. The software VCFtools is a great option.
https://vcftools.github.io/documentation.html
Note that they have compiled modules and interpreted modules (perl). The compiled modules should perform much better than an interpreted language (such as perl or R).
Good luck! Brian
Hi,
Is there a way to speed the reading process of a VCF file from the 3kricegneome public s3 bucket?
Could this process be parallelized?
And how about reading many VCFs at the same time?
Thanks alot.
load package
library(aws.s3) library(vcfR)
s3read_using(FUN = read.vcfR, bucket = '3kricegenome', object = "9311/B183.snp.vcf.gz")
session info for your system
sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS i7 8Gb RAM