Open aryaamgen opened 2 years ago
Any speedup with “n=2” and “chains=1”?
Nope. Takes about the same time for whether n = 1,2,4
and I get that error for n=8
.
Actually, I just realized the worker nodes I was using only had 4GB of RAM each. Could that be a problem? If I recall I think my master node had 4 cores, so maybe that's why the error is happing when I go from n=4
to n=8
. Is there a good way to check if my MPI is working? Maybe trying MPI with non-Torsten Stan? I can also see if Metrum has documentation on using MPI on Metworx. I just want to rule out whether it's a Torsten specific thing or not.
Let me know if you need more information about the setup and environment.
Allow me first test the example on my local build. There were a few occasions the same error were seen but gone when I switched to a different MPI build.
Here's what I found.
After using a fresh-installed MPI library (MPICH), I was able to build and run the model with number of processes = 1, 2 ,4, and get a speed up 1.6(n=2) and 2.7(n=4). I was running it using cmdstan:
mpiexec -n 4 ./twocpt_population sample data file=twocpt_population.data.R init=twocpt_population.data.R random seed=3289
I have no access to metworx so wasn't able to test n=8.
Since you didn't see any speedup I suspect the running script was not working properly. How was cmdstanr
getting information such as address of the worker nodes? Have you tried to run it on master nodesnode?
How was
cmdstanr
getting information such as address of the worker nodes?
Not sure. What do you mean? I also got the same error using cmdstan in the terminal. I'll try running on a master node and report back.
Is it also possible Metworx has an old incompatible version of MPICH? I noticed the instructions here mention to set CXXFLAGS += -isystem /usr/local/mpich3/include
if on Metworx, but on my Metworx machine I don't see an include
directory under /usr/local/mpich3
. I contacted the Metworx support about it and I'm waiting to hear back.
What do you mean? I also got the same error using cmdstan in the terminal.
When run mpiexec
, whether inside cmdstanr
or cmdstan
, one needs to supply information on how work nodes can be located. This can be done using the hostfile
option of Open MPI, or through a submission script. This may not be an easy thing to setup for users who are not familiar with cluster submission.
@billgillespie maybe you and someone in metworx team can help @aryaamgen ? We have some scripts to generate hostfile and metrum's aws specialist may be able to automate that even more.
Is it also possible Metworx has an old incompatible version of MPICH?
Yes, metworx can provide previous snapshots. But at this point I'm not sure the library is the cause here (well, except that Open MPI's error message is not very helpful). We can take this path if the issue is reproduced on master node and submission to work nodes is fixed.
Ok I tried running again on a 4 core master node and did get about a 2x speedup going from 1 to 2 nodes, but then no speedup at 4. I was also able to to try 7 nodes and that worked and was actually slower (I assume because communication to the worker nodes is slow). But for some reason 8 nodes still gives that same error.
When run
mpiexec
, whether insidecmdstanr
orcmdstan
, one needs to supply information on how work nodes can be located. This can be done using thehostfile
option of Open MPI, or through a submission script. This may not be an easy thing to setup for users who are not familiar with cluster submission.
Ah ok I see. Yeah I did not know about host files at all. So I guess Metworx is somehow generating hostfiles automatically? I wonder if there a way to see and edit those.
4 core master node
Is that 4 physical cores or 4 vCPUs? You can check that at metworx UI (master node information). If it's vCPU then what's provisioned is actually 4 threads, which would not scale at all in current MPI implementation.
4 core master node
Is that 4 physical cores or 4 vCPUs? You can check that at metworx UI (master node information). If it's vCPU then what's provisioned is actually 4 threads, which would not scale at all in current MPI implementation.
Ah ok interesting. Yes it's 4 vCPUs. Does that mean that for a single multicore machine I'd be better off using reduce_sum()
instead of MPI?
So far I've been using a 96 vCPU machine on Metworx and just wrapping my calls to pmx_solve_twocpt_rk45
in a reduce_sum
call with a for loop so it's parallelized that way. The motivation for switching to Torstens group integrator is that:
transformed parameters
instead of having to resolve them again in generated quantities
Yes it's 4 vCPUs.
Then the performance will almost surely degrade when n=4.
Does that mean that for a single multicore machine I'd be better off using reduce_sum() instead of MPI?
Probably but it also depends on specific model.
It would allow me to scale past the 96 vCPU limit moving to a cluster and adding nodes
This is the exact motivation of using MPI.
I confirmed that n >= 8 fails on Metworx using the default MPICH setup. It happened with 16 and 32 vCPU instances, so it's not due to limits on physical cores. I need to learn a bit more about MPICH to get around it---or switch to OpenMPI.
I'm almost sure this is the same problem we've run into previously, that metworx' vanilla installation fails but manually built MPICH doesn't. I just tested it on my local machine with only 4 cores that n=8, 16, 32 all run well. I'll see if this behavior can be reproduced on metworx later today.
Interesting. Should I try to manually build MPICH myself on the Metworx cluster or would that be too difficult if I don't have much experience with cluster submission? Did you follow these directions?
I was also able to confirm what @billgillespie saw. The following is probably the simplest workaround (master node metworx instance only. We can work on worker node details after clearing this).
sudo apt update
sudo apt -y install openmpi-bin
which installs openmpi's version of mpiexec
. This can be confirmed by running
mpiexec --version
The output should be
mpiexec (OpenRTE) 2.1.1
Now
mpiexec -n 8 ./twocpt_population sample data file=twocpt_population.data.R init=twocpt_population.init.R
should not crash (neither should n > 8).
Let me know if it works on your end.
Thanks Yi! I tried the above directions installing openmpi on an instance with vCPU=8 and no worker nodes, just master. I no longer get an error on n=8. However, I don't get any speedup going from 1 to 2 or 2 to 4 or 4 to 8. Actually at 8 it takes twice as long.
By the way should I still have the following in the make/local
file?
CXXFLAGS += -isystem /usr/local/include
CXXFLAGS += -isystem /usr/local/mpich3/include
However, I don't get any speedup going from 1 to 2 or 2 to 4 or 4 to 8
This is expected, because your binary twocpt_population
was compiled using the original MPICH library. But now we can confirm that the cause is the original mpiexec
.
There are two paths we can follow from this point: we can help you install a fresh MPI library, or we can wait for metworx team's fix. If you are comfortable following the MPICH installation guide I'd say go for it. If your team runs multiple instances of metworx it's probably a good idea we wait for the patch.
I installed following the install guide here. Although skipped steps 9 and 10 since I'm just running on a single machine. I'm now able to see a speedup going from n=1 to n=2 and to n=4. But I get the error again when I go up to n=8.
You need to ensure you are using the newly built mpich instead of the system version. I just did the following:
e.g. /data/mpich-install
, using the prefix=
option during configure
(see manual)
make/local
TORSTEN_MPI=1
TBB_CXX_TYPE=gcc
CXX=/data/mpich-install/bin/mpicxx
CC=/data/mpich-install/bin/mpicc
CXXFLAGS += -isystem /data/mpich-install/include
make clean-all && make -j4 ../example-models/twocpt_population/twocpt_population
/data/mpich-install/bin/mpiexec -n 8 ./twocpt_population sample data file=twocpt_population.data.R init=twocpt_population.init.R
Now I am able to run n=1,2,4 with speedup and n>8 without any issue.
Thanks @yizhang-yiz . That worked. I'm now able to run with n>8 and I see the speedup on the master node. Not sure how to set up the worker nodes to be able to communicate with the master. Do you know how to do that on Metworx?
Let me write out the procedure and test it on metworx before posting here. On Jun 24, 2022, 15:58 -0700, aryaamgen @.***>, wrote:
Thanks @yizhang-yiz . That worked. I'm now able to run with n>8 and I see the speedup on the master node. Not sure how to set up the worker nodes to be able to communicate with the master. Do you know how to do that on Metworx? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@yizhang-yiz any update on this? Was starting to try to get together some stuff for ACoP and would be nice to get a cluster working on Metworx.
Sorry for the delay. I was able to only look into this on and off. Let me take another look in this weekend.
You can follow the steps below:
N
is fixed. qconf -sel > hostfile
/data/mpich-install/bin/mpiexec -n 32 -bind-to core -f hostfile -l ./twocpt_population sample ...
Here -bind-to core
asks mpiexec to run one process per "physical" core instead of a vCPU/thread. -f hostfile
makes it use the nodes specified in the hostfile
, and -l
helps debugging by tagging the output with the rank of each process so that one can read which output line is from which process. The maximum number of prescribed processes should be less than the total cores in the cluster, i.e. in the above example $N \times \text{core-per-node} \ge 32$
In general as the number of cluster nodes increases the communication latency and other cluster overhead catches up, and the speedup tapers off. Also a small number of big nodes (nodes with more cores on each) is likely to be faster than a large number of small nodes. It may take several iterations should one target the optimal performance.
Here's another approach to using Torsten with MPI on a Metworx cluster. The attached example uses future
/furrr
with Grid Engine to distribute the chains over the required number of compute nodes while Torsten's MPI capabilities deal with solving the ODEs in parallel within each compute node.
In this case you don't need to pre-specify a fixed size cluster. This approach takes advantage of Metworx's auto-scaling capabilities. I.e., it will automatically launch the required number of compute nodes and then shut them down when they're no longer needed.
The attached zip file unzips into a directory named mpi_example1
containing an R script (mpi_example1.R
), a Stan model (mpi_example1.stan
), a data file (analysis3.csv
) and a few other files. The subdirectory results
contains some diagnostic plots generated by the example. Comments at the beginning of mpi_example1.R provide some instructions. Let me know if you need more information.
The attached plot ppkexpo4_mpi_run_time_summary001.pdf
shows how the run time decreases as the number of cores per chain increases on a Metworx instance with a 8 vCPU / 21 GB master node and 32 vCPU / 128 GB compute nodes.
One klugy bit about the example is it requires that at least one compute node be available before it will work. If no compute node is running, I launch one using the qtouch
function defined in mpi_example1.R. You'll have to wait a few minutes for the node to come up. At some point I need to work out an automatic way to do this.
Thanks @yizhang-yiz!
I was able to create the hostfile and run but it basically pauses once warmup is done and I have to ctrl-C out. I'm running the following command (note that I limited the tree depth to remove that confounder and making testing faster)
/home/apourzan/mpich-install/bin/mpiexec -n 4 -bind-to core -f hostfile -l ./twocpt_population sample num_chains=1 num_samples=500 num_warmup=500 algorithm=hmc engine=nuts max_depth=1 data file=twocpt_population.data.R init=twocpt_population.init.R output refresh=10
Then once I ctrl-C when it hangs at 50% after warmup I get the following errors for each worker
[proxy:0:0@ip-10-242-140-114] HYD_pmcd_pmip_control_cmd_cb (/home/apourzan/mpich-4.0.2/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:0@ip-10-242-140-114] HYDT_dmxu_poll_wait_for_event (/home/apourzan/mpich-4.0.2/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@ip-10-242-140-114] main (/home/apourzan/mpich-4.0.2/src/pm/hydra/pm/pmiserv/pmip.c:169): demux engine error waiting for event
The main node has 2vCPU and there are 4 nodes with 2vCPU each. Any idea what could be going on?
What happens if you use “num_chains=4” instead of 1?
edit: I meant to ask what if “mpiexec -n 1” is used with “num_chains=1”, does it still hang?
Here's another approach to using Torsten with MPI on a Metworx cluster. The attached example uses
future
/furrr
with Grid Engine to distribute the chains over the required number of compute nodes while Torsten's MPI capabilities deal with solving the ODEs in parallel within each compute node.
Thanks @billgillespie! I'll try to work through this. So this method essentially send each chain to a worker node? Since the maximum number of vCPUs per node is 96 does that limit the method to 96/2 cores per chain? The auto-scaling is attractive though. Does that mean I could have a small master node with say only 8 vCPU then the 96 vCPU workers will be spun up as needed or does the master also need to have that many vCPU's?
What happens if you use “num_chains=4” instead of 1?
edit: I meant to ask what if “mpiexec -n 1” is used with “num_chains=1”, does it still hang?
mpiexec -n 1
works. Any ideas?
So this method essentially send each chain to a worker node?
That depends on the cores requested per chain. For example if I specify 16 cores per chain and use 16 core (32 vCPU) compute nodes, then each chain gets its own compute node. On the other hand if I use 32 core nodes, then 2 chains run on each compute node.
Since the maximum number of vCPUs per node is 96 does that limit the method to 96/2 cores per chain?
Yes.
Does that mean I could have a small master node with say only 8 vCPU then the 96 vCPU workers will be spun up as needed or does the master also need to have that many vCPU's?
The former. In fact the master node just needs enough RAM to handle what the compute nodes return. A 2 vCPU master node is fine as long as it has enough RAM---which it probably doesn't with the available Metworx instances.
Not exactly the same run but
/data/mpich-install/bin/mpiexec -n 4 -f hostfile -bind-to core -l ./twocpt_population sample num_samples=10 num_warmup=500 num_chains=1 data file=./twocpt_population.data.R init=./twocpt_population.init.R random seed=1234 output refresh=10
did not hang. Could you try the above run? Also please check that the hostfile
contains the cluster nodes' address. In my case, I have
$ cat hostfile
ip-10-31-18-160.ec2.internal
ip-10-31-19-18.ec2.internal
ip-10-31-28-213.ec2.internal
ip-10-31-28-85.ec2.internal
Thanks @yizhang-yiz! I didn't change anything, but booted up a new cluster and it's no longer hanging. My main node has 2vCPU and 16GB RAM and I have 3 worker nodes with 2vCPU/16GB RAM each. I get about a 2x speedup going from -n 4
to -n 1
.
By the way should the main node be in the hostfile? For me it is not. I only have the 3 worker nodes in there.
Hostfile should only contain the nodes you'd like to use for the Stan run, in most cases, the slave nodes. If you have 3 slave nodes, -n 4
run will be distributing the population unevenly. With a population of 8, the nodes 1, 2, 3 will be working on subjects (1,2,3), (4, 5, 6), (7, 8), respectively.
In general, the following numbers affect population model scaling
n_subject
: number of subjects in the _group
solver runn_cores
: number of (physical) cores in the computing clustern_proc
: number of MPI processes, i.e., the arg to -n
.Assuming subjects have similar computing cost, we usually want n_proc
$\le$ n_cores
, and n_subject
divisible by n_proc
. In my hostfile I have 4 1-core working nodes, and -n 4
implies n_proc=4
, thus 8 subjects are distributed to the 4 nodes evenly, with each node working on 2 subjects. Naturally this rule of thumb becomes irrelevant as n_subject/n_proc
$\rightarrow \infty$ and each subject's cost becomes minuscule compared to the total cost.
Thanks again @yizhang-yiz for baring with me. This has been really informative. A couple last questions:
Hostfile should only contain the nodes you'd like to use for the Stan run, in most cases, the slave nodes. If you have 3 slave nodes,
-n 4
run will be distributing the population unevenly.
Oh so I can have my master node have fewer vCPUs since I'm only using the slave nodes for the Stan run? Also can I manually scale up and down the number of slave nodes as long as I recreate the host file with qconf -sel > hostfile
?
So just to make sure I have the rule right regarding n_proc
<= n_cores
. Say I have 4 worker nodes with 2 physical cores each (4 vCPU each). In that case n_cores
= 4*2 = 8 right? So would I then set -n 8
? If I kept it at -n 4
would that not fully utilize both of the physical cores on each worker node?
h so I can have my master node have fewer vCPUs since I'm only using the slave nodes for the Stan run?
yes
Also can I manually scale up and down the number of slave nodes as long as I recreate the host file with qconf -sel > hostfile?
yes
. In that case n_cores = 4*2 = 8 right?
yes
So would I then set -n 8? If I kept it at -n 4 would that not fully utilize both of the physical cores on each worker node?
Setting -n 8 is the not necessarily more efficient than -n 4, as it also depends on the model. But setting >8 will almost surely see performance degrade. In addition, 4 worker nodes with 2 cores is possibly less efficient than a single worker node with 8 cores. For PKPD models, I'd say "fewer large nodes" is likely better than "more small nodes".
Thanks again @yizhang-yiz!
@yizhang-yiz I've been playing with this more and I had a question. I'm now able to run with cmdstanr using the following command:
fit <-
model$sample_mpi(mpi_args = list("n" = 8, "bind-to" = "core", "f" = "hostfile"),
output_dir = file.dir,
data = file.path(file.dir, "twocpt_population.data.R"),
init = file.path(file.dir, "twocpt_population.init.R"),
seed = 123,
chains = 2,
iter_warmup = 50,
iter_sampling = 50,
refresh = 10,
max_treedepth = 1)
I notice that with chains = 2
my chains just run sequentially since sample_mpi()
has no parallel_chains
argument. Is there a workaround for this to get the chains to run in parallel? Say I'm running with 4 worker nodes that each have 4 cores. In that case I should have 16 cores total so shouldn't I be able to run 2 chains in parallel that each split up the work amongst 8 cores?
@yizhang-yiz I thought about this and one possible workaround is to just take @billgillespie's approach and do a separate sample_mpi()
call for each chain thenn unite the CSV's with the fit <- as_cmdstan_fit(here(resultsDir, paste0(modelName, 1:nChains, "-1.csv")))
call like he did. In theory you should still be able to assign multiple worker nodes to each chain using a different hostfile for each. I'm going to try this out.
Although I'm not sure how to also tie that in with autoscaling on Metworx. I still have to take a closer look at @billgillespie's approach to see how that works.
Unfortunately Bill's approach is probably the best here. Note that sample_mpi
function is not designed for within-chain MPI use like we've discussed. Instead the function and its "chains=" argument are for the experimental cross-chain warmup algorithm in which all the chains will communicate within each other during warmup (hence no parallel_chains
arg as the chains must be run in parallel), and the function got confused when used with TORSTEN_MPI
switch.
When you use sample_mpi(...chains=1)
the function is just a wrapper of the what you did in cmdstan with the mpiexec
command. I suppose one can use purrr::map
+ cmdstanr::sample_mpi
to slightly improve ergonomics, but one must pay attention to the hostfile passed to each mpi run, as using the same hostfile in multiple mpi runs doesn't necessarily distribute the runs evenly across the available nodes. In our context of a metworx slave cluster with fixed IPs, a hacky solution is to pass different hostfiles to different sample_mpi
calls.
I'm currently running Torsten v0.90.0 on Metrum's Metworx platform on a cluster, but I'm not sure if MPI is working. I've adding the following to
Torsten/cmdstan/make/local
I'm able to compile and run the
twocpt_population.stan
example usingmod$sample_mpi()
in cmdstanr, but I don't get any speedup from increasingn
in thempi_args = list("n" = 1)
argument. Furthermore, when I setn=8
I get the following error:Any ideas what could be going wrong? @yizhang-yiz