vcfR2genlight big data - Githubissues

JimWhiting91 commented 6 years ago

I'm working with a large WGS vcf with ~10,000,000 SNPs across 111 individuals and I'm having some issues with processing them with vcfR. I can read in the file as a whole vcf, although this can take up to an hour, however vcfR2genlight is very slow, as it's running on a single core as it does not appear to be registering the n.cores argument (64 cores available).

As an alternative, I've attempted to chunk the vcf and read in and convert chunks through mclapply, which speeds things up substantially and completes. However I then run into issues with combining these genlight objects (cbind very slow, data.table comes back with errors, general problems with combining S4 type objects).

Do you have a recommended workflow for working with large vcfs, or can you please advise on why the n.cores argument is being ignored by vcfR2genlight. Many thanks for any advice/help that you can provide.

Jim

knausb commented 6 years ago

Hi Jim, thanks for reporting this! Its been my experience that once R is using over 1 GB its performance starts to have issues. But we have some workarounds.

First off, are you using Windows? You mentioned mclapply() which makes me think you're using a Unix flavor. But we should validate that. In fact, if you could report the results of the following it would help.

sessionInfo()

Perhaps some context would help too. What are your analytical goals? Do you really need to analyze all those variants? Many popgen questions could be addressed with a representative subsample. Is it reasonable to assume different chromosomes are segregating differently in your populations? Would it be reasonable to break this up to a per chromosome analysis?

Thanks! Brian

JimWhiting91 commented 6 years ago

Hi Brian, thanks for getting back to me.

So the aim is to do some DAPC analyses using adegenet with a couple of goals. Firstly to do some population structure analyses to complement some additional methods and secondly to investigate polygenic selection (specifically to identify groups of variants that separate phenotype clusters). The first of these could probably be investigated using a subset of the data as you suggest, however I'd feel uneasy about removing variants prior to analyses of polygenic selection if it can be avoided. Session info is pasted below, thanks again for your assistance.

Jim

sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Scientific Linux release 6.8 (Carbon)

Matrix products: default BLAS: /cm/shared/apps/lapack/gcc/64/3.7.0/lib64/libblas.so.3.7.0 LAPACK: /cm/shared/apps/lapack/gcc/64/3.7.0/lib64/liblapack.so.3.7.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_3.4.3

library(parallel) detectCores() [1] 64

knausb commented 6 years ago

Hi Jim, thanks for getting back to me with that info. The function mclapply() uses 'threads' which is a Unix concept that does not exist on Windows. And adegenet uses mclapply() for multi-threading. Because you've validated that you're using Unix we can scratch that off the list.

In adegenet, multi-threading has been implemented, but not in vcfR. So it could be that n.cores was successfully set, but you did not get through the vcfR steps before you gave up. I see two potential bottlenecks in what vcfR2genlight() does. First we need to extract the genotypes from the vcf data. This can be accomplished with something like this.

gt <- extract.gt(myVcfRobject)

Hopefully, you can get through that step. If you can, then the next bottle neck may be transpose. VCF data has samples in columns and variant in rows. adegenet expects the opposite. So we use the transpose function to rotate the matrix.

gt2 <- t(gt)

If you can get here you may be home free. The functions find.clusters() and dapc will work on a matrix. So if the bottleneck comes from the genlight object, we may have a work around. I suspect the bottleneck might be in the above code though. Why don't you give it a try and let me know if you're successful. If it does you can either try working with the matrix or I can help you through converting this to a genlight object.

Also, R does come with a performance cost. We use interfaces with C to improve its performance, but everytime we convert between a C object and an R object there is a performance cost. Applications such as fastStructure or admixture might perform better for no other reason that they don't use R. I'd love to help you find an R solution. But when performance is really the issue, sometimes its best to avoid R.

Let me know if the above works. Brian

JimWhiting91 commented 6 years ago

Hi Brian,

Thanks for your advice. I was able to use the above to extract genotypes across vcfR chunks and split the jobs across all cores so it was substantially quicker. I then converted matrices to data.tables and saved these into a list output from mclapply. This list was "cbinded" using data.table:

setDT(unlist(list_of_genotype_dt, recursive = FALSE), check.names = TRUE)[]

which was substantially quicker than trying to cbind genlight objects. By converting back to a matrix I can now proceed as you suggest. This workflow splits a job that was taking many hours before being killed into many jobs that can be read in, transformed and combined in 19 minutes.

Many thanks again for your help, Jim

knausb commented 6 years ago

Wow! That's a huge increase. Thank you for raising this issue and thank you for documenting your solution. The one downside is that because it uses mclapply() it will not work for Windows users. In the past I've considered learning RcppParallel which should work on all platforms supported by CRAN. But I haven't gotten around to learning it. Based on your report it sounds like a large improvement in performance is possible. I've added this to my vcfR wish list so that I'll hopefully give it a greater priority.

Thanks! Brian

knausb / vcfR

vcfR2genlight big data #92