Computation time: is there way to reduce running time, e.g., GPU?

hmsc-r / HMSC

GNU General Public License v3.0

102 stars 37 forks source link

Computation time: is there way to reduce running time, e.g., GPU? #128

Open yihsiu opened 2 years ago

yihsiu commented 2 years ago

Dear HMSC team,

I've been running a multi-species hmsc model (69 species, ~2100 observations with 2 chains, 10 thinning, and 10000 samples) and found the computation time has been unexpectedly long (>7 days and still running). In addition to parallel computing, I was just wondering whether there are other suggestions for speeding up the computations, for instance, is it currently possible to run HMSC with GPU?

Any suggestions on speeding up the computations will be appreciated.

Thank you very much.

Best wishes Yi-Hsiu

yihsiu commented 2 years ago

Ah sorry, I just broke the loop and rerun a trial model with smaller samples and verbose set to 1. Then I found the total iterations are actually the sum of samples thinning and burnin thinning, not the set samples value. The original sample size is probably too large for just testing. I have changed my MCMC setting to smaller values but am still pretty interested in knowing whether there are other ways to speed up the computations.

Thank you.

ovaskain commented 2 years ago

Hi Yihsiu,

Speeding up computations (including possibility of using GPU) is one of the main targets for Hmsc development. However, difficult to predict when a GPU-friendly version might be available. It is always recommended to first run just a couple of samples to be able to estimate the running times and thus avoid unknown waiting times. I usually set samples=250, nChains=5, transient = round(0.5samplesthin), and then loop over thin=c(1,10,100,1000…). In this way I get preliminary results (thin=1) fast and can move on with the analyses while the more definite MCMC is still running. Then I examine the MCMC convergence of the largest thin fitted thus far to see if I can stop the runs or even longer chain is still needed.

Best,

Otso

From: yihsiu @.> Sent: tiistai 7. joulukuuta 2021 7:47 To: hmsc-r/HMSC @.> Cc: Subscribed @.***> Subject: Re: [hmsc-r/HMSC] Computation time: is there way to reduce running time, e.g., GPU? (Issue #128)

Ah sorry again, I just broke the loop and rerun a trial model with smaller samples and verbose set to 1. Then I found the total iterations is the sum of samplesthinning and burninthinning, not just set samples. I have changed my MCMC setting to smaller values but am still pretty interested in knowing whether there are other ways to speed up the computations.

Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/hmsc-r/HMSC/issues/128#issuecomment-987588966, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEIYMZSSGQ7I4AKIPMIFNUTUPWNUJANCNFSM5JQH5TIQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jarioksa commented 2 years ago

Some Hmsc can run very long: seven days is not much in that game, but seven weeks is normal. The main thing for speeding up computations is outside Hmsc and even outside of R: the implementation of Lapack linear algebra library, and even more crucially, implementation of Basic Linear Algebra Subprograms (BLAS). We can get 100-fold timing differences with different BLAS implementations. What you can use, depends on your hardware and on your operating system. In Linux, openBLAS with multicore systems (say 40) can be really fast. In MacOS, the use of Mac's native accelerated BLAS gives a great speed-up. In particular, Mac silicon (M1) with a dedicated processor for BLAS ("Neural Engine") gives a huge speed-up. In Windows you just must trust in Microsoft, but there are reports that builds with Intel tools can be really fast. We really are inspecting the use of GPU with people in the national supercomputer service and some GPU manufacturers, but here it seems to boil down to the same issue: how to get an GPU-enabled BLAS. In principle, those exist, but they are not usually available for your computing platform. The computation time is spent in BLAS, and that should be sped up. We can do very little in Hmsc as the time-consuming parts are system services.

jarioksa commented 2 years ago

Like Otso wrote, you should not jump into final analysis at once, but you really should first run shorter runs to see what are the values of thin and transient (burn-in) you need. The coda package provides some tools to analyse those issues. Otso's suggestion of starting with thin=1 gives you the first idea, but for most models I would expect thin=10 to be too low (and usually I'd say that samples=10000 is exaggeration, but typical for ecologists). If you have many cores, and your BLAS is not multicore (if BLAS is multicore, each chain will get a lower number of cores for BLAS and calculations are not faster), I'd suggest you use more chains allowing shorter samples for the same total number of posterior samples.

yihsiu commented 2 years ago

Dear Otso & Jari,

Thank you both very much for the reply, and sorry for the late response, it took me some time to arrange a better machine with Linux (Ubuntu) and openBLAS installed for this project. I have given it a few tries on the new computer, the result is incredible - the same script that used to take several days to run now can be done in 10 hours.

I am not familiar with Linux system and still trying to figure out why there is such a difference: as you mentioned that openBLAS with multicore systems can much improve the computation speed in Linux, I was wondering, does it mean that openBLAS improves the computation speed through a logic similar with parallel computing (so the more cores the better the performance)? or it simply improves the computation efficiency of matrix calculation and doesn't necessarily depend on the number of the cores?

And, since I am currently testing how many iterations are required for this model, I was also wondering is there currently an inbuilt function in Hmsc for updating HMSC models with more iterations while we find the model hasn't properly converged?

Thank you both again for the help!

Best wishes Yi-Hsiu

jarioksa commented 2 years ago

Basically you can speed up calculations in two ways: with parallelization and with vectorization. In parallelization you run several large tasks in parallel in different CPUs, and in matrix algebra this means dividing matrices into slices, making calculations simultaneously for each in different CPU, and collecting results together. In vectorization you perform several simple computations during the same clock cycle ("simultaneously") in one CPU. This is called "Single Instruction, Multiple Data" or SIMD (and also is a type of parallelization but within one CPU). Matrix is made of vectors, and when vector is contiguous in the computer memory, you can, say, multiply eight elements during one clock cycle instead of having one multiplication per clock cycle. BLAS can be made to use both, and honestly I don't what openBLAS can do. It is sure that it uses parallelization over CPUs: I have monitored the CPU usages of processes in Linux, and I have seen values much over 100% for each thread in Linux, up to 1000% meaning that this parallel thread was using 10 CPUs. I don't know about its vectorization. Typically GPUs are SIMD processors, but I don't know about publicly available GPU-enabled BLAS. It seems that the new Mac M1 ARM processor has a separate co-processor for matrix algebra ("Neural Engine") and that probably is based on SIMD, but it is still unable to use its 8 (or more) GPUs for the same task.

Then about continuing analyses. Currently we have a c() function that can add new chains to old models. If you have two (or more) models, you can combine them so that the chains of second are added to the first model. For this your models should be identically defined (also the same size, thin, transient etc.). See the help of c.Hmsc in the current github. This means that you have to re-run transient which gives some overhead, but you can get more samples by adding chains to old models. We have discussed implementing function to continue chains from the old model, but writing such function needs much care. Brave people have tried that manually, but it really needs much care and manual work to do it correctly.

admahood commented 1 year ago

This has been an extremely helpful thread. Thanks!

jonpeake commented 1 year ago

For those who may be monkeying around with different BLAS: I'm running a spatial MCMC model on a 21k-observation, 90-species dataset with Intel's MKL BLAS implemented. The single-thread MKL BLAS is considerably faster than the default R BLAS, but I have noticed that there is virtually no speed increase when increasing the number of MKL threads. From my digging, it appears that the bottleneck occurs with the number of individual linear algebra computations that are occurring over the course of a single MCMC step. Combine the number of computations with the time it takes for the processor to split, load, and offload data/results from each core, and in some cases a large number of threads actually decreased the speed compared to the single-thread implementation. Jari or Otso may have more insight there, but I would guess that a multi-thread BLAS implementation is only really useful for much larger datasets. It would be better for these "smaller" datasets to increase the number of parallel chains (while potentially decreasing the number of samples per chain to reduce the sampling time) rather than increase the number of BLAS threads.

As a side note, the c() function is particularly useful in this regard, especially if you have access to an HPC cluster. The ability to run several simultaneous multi-chain sampleMCMC procedures on separate nodes and then combine chains into one "super-chained" model is game changing.

jarioksa commented 1 year ago

There is an overhead of forking (or launching a socket) and combining results, but that will only occur at the start and end of the chain: each chain will run in a separate process and there is no splitting & combining after a BLAS call within the chain. What happens within a single BLAS call is another issue that we cannot control from HMSC: number of chains is the ceiling in HMSC parallel processes, and BLAS forking is controlled by the Operating System independent of settings in Hmsc calls. One common problem is that if you have N parallel processes (chains), you will need N times the memory, and excessive memory requests can throttle the process, and then you just idle when data are unloaded and loaded in memory. In some cases, it is possible that there is a race for shared resource. I don't know how different BLASes are implemented, and could some have a race if several processes call the library simultaneously. This certainly can happen in Mac M1/M2 computers which have dedicated fast BLAS hardware ("neural engine") which can be exhausted when too many processes try to share its limited resources (16 cores). In eight-core Mac I normally use four or five forked processes or fewer in large data sets and large models (such as spatial models) which need a lot of memory.

jonpeake commented 1 year ago

@jarioksa, I was more referring to a multi-threaded BLAS where each individual linear algebra task (e.g., cholesky decomposition, matrix multiplication) is split among multiple threads. Several BLAS configurations have an option for using a single thread or using a number of threads up to the number of physical cores on the CPU. You can use this in combo with forking of chains to essentially use the entirety of a given computing configuration (i.e., optimizing the amount of RAM and CPU used), but what I have found is that the single threaded version of the BLAS (i.e., setting the number of MKL or OpenBLAS threads to 1) is just as fast and sometimes faster than using a multi-threaded version of the same BLAS (i.e., setting the number of threads to the number of CPU cores).

For example, when I was testing out timing for my model on my university's HPC cluster using R built with Intel MKL, I found that whether I set the number of MKL threads to 1 or to 12 didn't change my time for each individual MCMC sample; each sample took approximately 1.5 minutes regardless of the number of threads used to compute the posteriors for the sample. So I decided to split my samples across 12 chains instead (the amount of memory on this particular partition allowed for that).

But yes, as far as forked processes, I've also found memory to be the limiting factor. That's where I've found that running several multi-chain models across nodes in an HPC cluster (forked chains within a single node, but with independent runs of the same model across nodes) can help to alleviate this bottleneck while still increasing sample size to a suitable effective level.

To clarify, this was more an informational comment for others who may be interested in implementing a different BLAS for their environment. I completely understand that it's impossible to control what the BLAS is doing from within Hmsc. It was more a comment on the behavior I've witnessed as far as efficiency of running these models.

jarioksa commented 1 year ago

Actually I had a similar experience in our local 40-cpu Linux: timing was nearly independent of the number of parallel processes in Hmsc. Once I was the only user, and found out that sampleMcmc was using 4000% of cpu when running with four parallel processes – and it was using 4000% when running in four parallel processes. Total timing was nearly equal, and running chains in parallel did not help. Most of the time sampleMcmc spends in matrix multiplication, Cholesky decomposition and other BLAS and from the perspective of using computing processes, all the rest – the thousands of lines that make Hmsc – are nothing but glue between BLAS calls. Things change when you limit (at the OS level) the number cores a single process can use, and if you allow 4 cpus for one process, you can 16 for four parallel processes. Limiting processes is regarded as socially responsible, and usually we do so, although I think that OS would take care of democratic allocation of cpus with multiple users (but I don't want to get a computer ASBO).