vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
461 stars 97 forks source link

Calculating RDA on large datasets #527

Open TonyKess opened 2 years ago

TonyKess commented 2 years ago

Hello,

We're using the RDA function to carry out genome scans for signals of adaptation similar to this paper - we are beginning to run into speed problems with very large datasets (e.g. 1e+7 x 1000 matrices). Are there any solutions for speeding up computation of the RDA for very large datasets? We have looked into parallelizing across subsets of the data, but I was curious if there were other methods available. Any advice appreciated!

jarioksa commented 2 years ago

I think memory may be a bigger issue than speed: time has no limit, memory has.

There is no special handling of large data sets in vegan::rda. However, most of the time will be spent in matrix algebra that is handled by external BLAS/LAPACK libraries in R. Many of these basic linear algebra subroutines (BLAS) can be implemented as parallelized and vectorized. You should check your BLAS. R comes with a simple "reference BLAS" that is slow, and using optimized BLAS (and LAPACK, but the keystone is BLAS) can give you huge speed up. So start checking your BLAS: sessionInfo() tells you what kind of BLAS and LAPACK you have. If both of these point to your R installation, you should inspect possibilities of getting something better. Good alternatives are Intel MKL (Math Kernel Library), openBLAS and in Mac the Accelerate Framework (if that is used in Mac, the BLAS entry may be missing in sessionInfo). For instance, in my M2 Macbook, the Accelerate BLAS is 160 times faster in some BLAS routines than the reference BLAS in the same computer (and both are fast compared to Intel PCs). I think there is no need to develop parallel RDA, but parallelization (and SIMD vectorization) should be handled in BLAS. But don't forget the memory (pun intended): if memory is exhausted, everything gets very slow.

Another issue is that there are no safeguards for simple stats for 1e7 observations. Things like sum, mean, variance can become unreliable with such a huge number of observations. I don't know, because the code was never developed or tested for such cases. It may be OK, or it may not be OK.

jarioksa commented 2 years ago

@TonyKess : I had a look at your profile. If RDA can help in getting halibut in fishmongers, I hope you can make RDA work. Halibut is my favourite!

TonyKess commented 2 years ago

Thanks for this advice! We are checking out BLAS now, and looking into building some checks on internal stats for when we are using really large datasets. We've used RDA successfully on Halibut, but have some other tasty species to use it on now too!