Problem Running MPI on Metworx

aryaamgen commented 2 years ago

I'm currently running Torsten v0.90.0 on Metrum's Metworx platform on a cluster, but I'm not sure if MPI is working. I've adding the following to Torsten/cmdstan/make/local

TORSTEN_MPI=1
TBB_CXX_TYPE=gcc
CXXFLAGS += -isystem /usr/local/include
CXXFLAGS += -isystem /usr/local/mpich3/include

I'm able to compile and run the twocpt_population.stan example using mod$sample_mpi() in cmdstanr, but I don't get any speedup from increasing n in the mpi_args = list("n" = 1) argument. Furthermore, when I set n=8 I get the following error:

Chain 1 Fatal error in PMPI_Test: Other MPI error, error stack:
Chain 1 PMPI_Test(174).................: MPI_Test(request=0x564c49888a60, flag=0x7ffe48c9b8bc, status=0x1) failed
Chain 1 MPIR_Test_impl(67).............: 
Chain 1 MPIDU_Sched_progress_state(961): Invalid communicator
Chain 1 Fatal error in PMPI_Test: Other MPI error, error stack:
Chain 1 PMPI_Test(174).................: MPI_Test(request=0x556df4feedd0, flag=0x7ffe7597d47c, status=0x1) failed
Chain 1 MPIR_Test_impl(67).............: 
Chain 1 MPIDU_Sched_progress_state(961): Invalid communicator
Warning: Chain 1 finished unexpectedly!

Warning messages:
1: In mod$sample_mpi(data = file.path(file.dir, "twocpt_population.data.R"),  :
  'validate_csv' is deprecated. Please use 'diagnostics' instead.
2: No chains finished successfully. Unable to retrieve the fit.

Any ideas what could be going wrong? @yizhang-yiz

yizhang-yiz commented 2 years ago

Any speedup with “n=2” and “chains=1”?

aryaamgen commented 2 years ago

Nope. Takes about the same time for whether n = 1,2,4 and I get that error for n=8.

aryaamgen commented 2 years ago

Actually, I just realized the worker nodes I was using only had 4GB of RAM each. Could that be a problem? If I recall I think my master node had 4 cores, so maybe that's why the error is happing when I go from n=4 to n=8. Is there a good way to check if my MPI is working? Maybe trying MPI with non-Torsten Stan? I can also see if Metrum has documentation on using MPI on Metworx. I just want to rule out whether it's a Torsten specific thing or not.

Let me know if you need more information about the setup and environment.

yizhang-yiz commented 2 years ago

Allow me first test the example on my local build. There were a few occasions the same error were seen but gone when I switched to a different MPI build.

yizhang-yiz commented 2 years ago

Here's what I found.

After using a fresh-installed MPI library (MPICH), I was able to build and run the model with number of processes = 1, 2 ,4, and get a speed up 1.6(n=2) and 2.7(n=4). I was running it using cmdstan:

mpiexec -n 4 ./twocpt_population sample data file=twocpt_population.data.R init=twocpt_population.data.R random seed=3289

I have no access to metworx so wasn't able to test n=8.

Since you didn't see any speedup I suspect the running script was not working properly. How was cmdstanr getting information such as address of the worker nodes? Have you tried to run it on master ~~nodes~~node?

aryaamgen commented 2 years ago

How was cmdstanr getting information such as address of the worker nodes?

Not sure. What do you mean? I also got the same error using cmdstan in the terminal. I'll try running on a master node and report back.

Is it also possible Metworx has an old incompatible version of MPICH? I noticed the instructions here mention to set CXXFLAGS += -isystem /usr/local/mpich3/include if on Metworx, but on my Metworx machine I don't see an include directory under /usr/local/mpich3. I contacted the Metworx support about it and I'm waiting to hear back.

yizhang-yiz commented 2 years ago

What do you mean? I also got the same error using cmdstan in the terminal.

When run mpiexec, whether inside cmdstanr or cmdstan, one needs to supply information on how work nodes can be located. This can be done using the hostfile option of Open MPI, or through a submission script. This may not be an easy thing to setup for users who are not familiar with cluster submission.

@billgillespie maybe you and someone in metworx team can help @aryaamgen ? We have some scripts to generate hostfile and metrum's aws specialist may be able to automate that even more.

Is it also possible Metworx has an old incompatible version of MPICH?

Yes, metworx can provide previous snapshots. But at this point I'm not sure the library is the cause here (well, except that Open MPI's error message is not very helpful). We can take this path if the issue is reproduced on master node and submission to work nodes is fixed.

aryaamgen commented 2 years ago

Ok I tried running again on a 4 core master node and did get about a 2x speedup going from 1 to 2 nodes, but then no speedup at 4. I was also able to to try 7 nodes and that worked and was actually slower (I assume because communication to the worker nodes is slow). But for some reason 8 nodes still gives that same error.

When run mpiexec, whether inside cmdstanr or cmdstan, one needs to supply information on how work nodes can be located. This can be done using the hostfile option of Open MPI, or through a submission script. This may not be an easy thing to setup for users who are not familiar with cluster submission.

Ah ok I see. Yeah I did not know about host files at all. So I guess Metworx is somehow generating hostfiles automatically? I wonder if there a way to see and edit those.

yizhang-yiz commented 2 years ago

4 core master node

Is that 4 physical cores or 4 vCPUs? You can check that at metworx UI (master node information). If it's vCPU then what's provisioned is actually 4 threads, which would not scale at all in current MPI implementation.

aryaamgen commented 2 years ago

4 core master node

Is that 4 physical cores or 4 vCPUs? You can check that at metworx UI (master node information). If it's vCPU then what's provisioned is actually 4 threads, which would not scale at all in current MPI implementation.

Ah ok interesting. Yes it's 4 vCPUs. Does that mean that for a single multicore machine I'd be better off using reduce_sum() instead of MPI?

So far I've been using a 96 vCPU machine on Metworx and just wrapping my calls to pmx_solve_twocpt_rk45 in a reduce_sum call with a for loop so it's parallelized that way. The motivation for switching to Torstens group integrator is that:

It would allow me to scale past the 96 vCPU limit moving to a cluster and adding nodes
My Stan code would look slightly cleaner without an extra loop and I could save the results of the ODE solve in transformed parameters instead of having to resolve them again in generated quantities

yizhang-yiz commented 2 years ago

Yes it's 4 vCPUs.

Then the performance will almost surely degrade when n=4.

Does that mean that for a single multicore machine I'd be better off using reduce_sum() instead of MPI?

Probably but it also depends on specific model.

It would allow me to scale past the 96 vCPU limit moving to a cluster and adding nodes

This is the exact motivation of using MPI.

billgillespie commented 2 years ago

I confirmed that n >= 8 fails on Metworx using the default MPICH setup. It happened with 16 and 32 vCPU instances, so it's not due to limits on physical cores. I need to learn a bit more about MPICH to get around it---or switch to OpenMPI.

yizhang-yiz commented 2 years ago

I'm almost sure this is the same problem we've run into previously, that metworx' vanilla installation fails but manually built MPICH doesn't. I just tested it on my local machine with only 4 cores that n=8, 16, 32 all run well. I'll see if this behavior can be reproduced on metworx later today.

aryaamgen commented 2 years ago

Interesting. Should I try to manually build MPICH myself on the Metworx cluster or would that be too difficult if I don't have much experience with cluster submission? Did you follow these directions?

yizhang-yiz commented 2 years ago

I was also able to confirm what @billgillespie saw. The following is probably the simplest workaround (master node metworx instance only. We can work on worker node details after clearing this).

Install openmpi

sudo apt update
sudo apt -y install openmpi-bin

which installs openmpi's version of mpiexec. This can be confirmed by running

mpiexec --version

The output should be

mpiexec (OpenRTE) 2.1.1

Use the new mpiexec to execute the job

Now

mpiexec -n 8 ./twocpt_population sample data file=twocpt_population.data.R init=twocpt_population.init.R

should not crash (neither should n > 8).

Let me know if it works on your end.

aryaamgen commented 2 years ago

Thanks Yi! I tried the above directions installing openmpi on an instance with vCPU=8 and no worker nodes, just master. I no longer get an error on n=8. However, I don't get any speedup going from 1 to 2 or 2 to 4 or 4 to 8. Actually at 8 it takes twice as long.

By the way should I still have the following in the make/local file?

CXXFLAGS += -isystem /usr/local/include
CXXFLAGS += -isystem /usr/local/mpich3/include

yizhang-yiz commented 2 years ago

However, I don't get any speedup going from 1 to 2 or 2 to 4 or 4 to 8

This is expected, because your binary twocpt_population was compiled using the original MPICH library. But now we can confirm that the cause is the original mpiexec.

There are two paths we can follow from this point: we can help you install a fresh MPI library, or we can wait for metworx team's fix. If you are comfortable following the MPICH installation guide I'd say go for it. If your team runs multiple instances of metworx it's probably a good idea we wait for the patch.

aryaamgen commented 2 years ago

I installed following the install guide here. Although skipped steps 9 and 10 since I'm just running on a single machine. I'm now able to see a speedup going from n=1 to n=2 and to n=4. But I get the error again when I go up to n=8.

yizhang-yiz commented 2 years ago

You need to ensure you are using the newly built mpich instead of the system version. I just did the following:

configure and make mpich in my instance in a local folder

e.g. /data/mpich-install, using the prefix= option during configure (see manual)

change `make/local`

TORSTEN_MPI=1
TBB_CXX_TYPE=gcc

CXX=/data/mpich-install/bin/mpicxx
CC=/data/mpich-install/bin/mpicc
CXXFLAGS += -isystem /data/mpich-install/include

rebuild the model

make clean-all && make -j4 ../example-models/twocpt_population/twocpt_population

use the new build to run

/data/mpich-install/bin/mpiexec -n 8 ./twocpt_population sample data file=twocpt_population.data.R init=twocpt_population.init.R

Now I am able to run n=1,2,4 with speedup and n>8 without any issue.

aryaamgen commented 2 years ago

Thanks @yizhang-yiz . That worked. I'm now able to run with n>8 and I see the speedup on the master node. Not sure how to set up the worker nodes to be able to communicate with the master. Do you know how to do that on Metworx?

yizhang-yiz commented 2 years ago

Let me write out the procedure and test it on metworx before posting here. On Jun 24, 2022, 15:58 -0700, aryaamgen @.***>, wrote:

Thanks @yizhang-yiz . That worked. I'm now able to run with n>8 and I see the speedup on the master node. Not sure how to set up the worker nodes to be able to communicate with the master. Do you know how to do that on Metworx? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

aryaamgen commented 2 years ago

@yizhang-yiz any update on this? Was starting to try to get together some stuff for ACoP and would be nice to get a cluster working on Metworx.

yizhang-yiz commented 2 years ago

Sorry for the delay. I was able to only look into this on and off. Let me take another look in this weekend.

yizhang-yiz commented 2 years ago

You can follow the steps below:

Lauch a Workflow with "Maintain initial size" option, so that the cluster size N is fixed.
Generate a hostfile that contains the cluster nodes:
```
 qconf -sel > hostfile
```
Run the stan model by specifying the hosts
```
/data/mpich-install/bin/mpiexec -n 32 -bind-to core -f hostfile -l ./twocpt_population sample ...
```
Here -bind-to core asks mpiexec to run one process per "physical" core instead of a vCPU/thread. -f hostfile makes it use the nodes specified in the hostfile, and -l helps debugging by tagging the output with the rank of each process so that one can read which output line is from which process. The maximum number of prescribed processes should be less than the total cores in the cluster, i.e. in the above example $N \times \text{core-per-node} \ge 32$

In general as the number of cluster nodes increases the communication latency and other cluster overhead catches up, and the speedup tapers off. Also a small number of big nodes (nodes with more cores on each) is likely to be faster than a large number of small nodes. It may take several iterations should one target the optimal performance.

billgillespie commented 2 years ago

Here's another approach to using Torsten with MPI on a Metworx cluster. The attached example uses future/furrr with Grid Engine to distribute the chains over the required number of compute nodes while Torsten's MPI capabilities deal with solving the ODEs in parallel within each compute node.

In this case you don't need to pre-specify a fixed size cluster. This approach takes advantage of Metworx's auto-scaling capabilities. I.e., it will automatically launch the required number of compute nodes and then shut them down when they're no longer needed.

The attached zip file unzips into a directory named mpi_example1 containing an R script (mpi_example1.R), a Stan model (mpi_example1.stan), a data file (analysis3.csv) and a few other files. The subdirectory results contains some diagnostic plots generated by the example. Comments at the beginning of mpi_example1.R provide some instructions. Let me know if you need more information.

The attached plot ppkexpo4_mpi_run_time_summary001.pdf shows how the run time decreases as the number of cores per chain increases on a Metworx instance with a 8 vCPU / 21 GB master node and 32 vCPU / 128 GB compute nodes.

One klugy bit about the example is it requires that at least one compute node be available before it will work. If no compute node is running, I launch one using the qtouch function defined in mpi_example1.R. You'll have to wait a few minutes for the node to come up. At some point I need to work out an automatic way to do this.

mpi_example1.zip ppkexpo4_mpi_run_time_summary001.pdf

aryaamgen commented 2 years ago

Thanks @yizhang-yiz!

I was able to create the hostfile and run but it basically pauses once warmup is done and I have to ctrl-C out. I'm running the following command (note that I limited the tree depth to remove that confounder and making testing faster)

/home/apourzan/mpich-install/bin/mpiexec -n 4 -bind-to core -f hostfile -l ./twocpt_population sample num_chains=1 num_samples=500 num_warmup=500 algorithm=hmc engine=nuts max_depth=1 data file=twocpt_population.data.R init=twocpt_population.init.R output refresh=10

Then once I ctrl-C when it hangs at 50% after warmup I get the following errors for each worker

[proxy:0:0@ip-10-242-140-114] HYD_pmcd_pmip_control_cmd_cb (/home/apourzan/mpich-4.0.2/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:0@ip-10-242-140-114] HYDT_dmxu_poll_wait_for_event (/home/apourzan/mpich-4.0.2/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@ip-10-242-140-114] main (/home/apourzan/mpich-4.0.2/src/pm/hydra/pm/pmiserv/pmip.c:169): demux engine error waiting for event

The main node has 2vCPU and there are 4 nodes with 2vCPU each. Any idea what could be going on?

yizhang-yiz commented 2 years ago

What happens if you use “num_chains=4” instead of 1?

edit: I meant to ask what if “mpiexec -n 1” is used with “num_chains=1”, does it still hang?

aryaamgen commented 2 years ago

Here's another approach to using Torsten with MPI on a Metworx cluster. The attached example uses future/furrr with Grid Engine to distribute the chains over the required number of compute nodes while Torsten's MPI capabilities deal with solving the ODEs in parallel within each compute node.

Thanks @billgillespie! I'll try to work through this. So this method essentially send each chain to a worker node? Since the maximum number of vCPUs per node is 96 does that limit the method to 96/2 cores per chain? The auto-scaling is attractive though. Does that mean I could have a small master node with say only 8 vCPU then the 96 vCPU workers will be spun up as needed or does the master also need to have that many vCPU's?

aryaamgen commented 2 years ago

What happens if you use “num_chains=4” instead of 1?

edit: I meant to ask what if “mpiexec -n 1” is used with “num_chains=1”, does it still hang?

mpiexec -n 1 works. Any ideas?

billgillespie commented 2 years ago

So this method essentially send each chain to a worker node?

That depends on the cores requested per chain. For example if I specify 16 cores per chain and use 16 core (32 vCPU) compute nodes, then each chain gets its own compute node. On the other hand if I use 32 core nodes, then 2 chains run on each compute node.

Since the maximum number of vCPUs per node is 96 does that limit the method to 96/2 cores per chain?

Yes.

Does that mean I could have a small master node with say only 8 vCPU then the 96 vCPU workers will be spun up as needed or does the master also need to have that many vCPU's?

The former. In fact the master node just needs enough RAM to handle what the compute nodes return. A 2 vCPU master node is fine as long as it has enough RAM---which it probably doesn't with the available Metworx instances.

yizhang-yiz commented 2 years ago

Not exactly the same run but

/data/mpich-install/bin/mpiexec -n 4 -f hostfile -bind-to core -l ./twocpt_population sample num_samples=10 num_warmup=500 num_chains=1 data file=./twocpt_population.data.R init=./twocpt_population.init.R random seed=1234 output refresh=10

did not hang. Could you try the above run? Also please check that the hostfile contains the cluster nodes' address. In my case, I have

$ cat hostfile
ip-10-31-18-160.ec2.internal
ip-10-31-19-18.ec2.internal
ip-10-31-28-213.ec2.internal
ip-10-31-28-85.ec2.internal

aryaamgen commented 2 years ago

Thanks @yizhang-yiz! I didn't change anything, but booted up a new cluster and it's no longer hanging. My main node has 2vCPU and 16GB RAM and I have 3 worker nodes with 2vCPU/16GB RAM each. I get about a 2x speedup going from -n 4 to -n 1.

By the way should the main node be in the hostfile? For me it is not. I only have the 3 worker nodes in there.

yizhang-yiz commented 2 years ago

Hostfile should only contain the nodes you'd like to use for the Stan run, in most cases, the slave nodes. If you have 3 slave nodes, -n 4 run will be distributing the population unevenly. With a population of 8, the nodes 1, 2, 3 will be working on subjects (1,2,3), (4, 5, 6), (7, 8), respectively.

In general, the following numbers affect population model scaling

n_subject : number of subjects in the _group solver run
n_cores: number of (physical) cores in the computing cluster
n_proc: number of MPI processes, i.e., the arg to -n.

Assuming subjects have similar computing cost, we usually want n_proc $\le$ n_cores, and n_subject divisible by n_proc. In my hostfile I have 4 1-core working nodes, and -n 4 implies n_proc=4, thus 8 subjects are distributed to the 4 nodes evenly, with each node working on 2 subjects. Naturally this rule of thumb becomes irrelevant as n_subject/n_proc $\rightarrow \infty$ and each subject's cost becomes minuscule compared to the total cost.

aryaamgen commented 2 years ago

Thanks again @yizhang-yiz for baring with me. This has been really informative. A couple last questions:

Hostfile should only contain the nodes you'd like to use for the Stan run, in most cases, the slave nodes. If you have 3 slave nodes, -n 4 run will be distributing the population unevenly.

Oh so I can have my master node have fewer vCPUs since I'm only using the slave nodes for the Stan run? Also can I manually scale up and down the number of slave nodes as long as I recreate the host file with qconf -sel > hostfile?

So just to make sure I have the rule right regarding n_proc <= n_cores. Say I have 4 worker nodes with 2 physical cores each (4 vCPU each). In that case n_cores = 4*2 = 8 right? So would I then set -n 8? If I kept it at -n 4 would that not fully utilize both of the physical cores on each worker node?

yizhang-yiz commented 2 years ago

h so I can have my master node have fewer vCPUs since I'm only using the slave nodes for the Stan run?

yes

Also can I manually scale up and down the number of slave nodes as long as I recreate the host file with qconf -sel > hostfile?

yes

. In that case n_cores = 4*2 = 8 right?

yes

So would I then set -n 8? If I kept it at -n 4 would that not fully utilize both of the physical cores on each worker node?

Setting -n 8 is the not necessarily more efficient than -n 4, as it also depends on the model. But setting >8 will almost surely see performance degrade. In addition, 4 worker nodes with 2 cores is possibly less efficient than a single worker node with 8 cores. For PKPD models, I'd say "fewer large nodes" is likely better than "more small nodes".

aryaamgen commented 2 years ago

Thanks again @yizhang-yiz!

aryaamgen commented 2 years ago

@yizhang-yiz I've been playing with this more and I had a question. I'm now able to run with cmdstanr using the following command:

fit <-
  model$sample_mpi(mpi_args = list("n" = 8, "bind-to" = "core", "f" = "hostfile"),
                   output_dir = file.dir,
                   data = file.path(file.dir, "twocpt_population.data.R"),
                   init = file.path(file.dir, "twocpt_population.init.R"),
                   seed = 123,
                   chains = 2,
                   iter_warmup = 50,
                   iter_sampling = 50,
                   refresh = 10,
                   max_treedepth = 1)

I notice that with chains = 2 my chains just run sequentially since sample_mpi() has no parallel_chains argument. Is there a workaround for this to get the chains to run in parallel? Say I'm running with 4 worker nodes that each have 4 cores. In that case I should have 16 cores total so shouldn't I be able to run 2 chains in parallel that each split up the work amongst 8 cores?

aryaamgen commented 2 years ago

@yizhang-yiz I thought about this and one possible workaround is to just take @billgillespie's approach and do a separate sample_mpi() call for each chain thenn unite the CSV's with the fit <- as_cmdstan_fit(here(resultsDir, paste0(modelName, 1:nChains, "-1.csv"))) call like he did. In theory you should still be able to assign multiple worker nodes to each chain using a different hostfile for each. I'm going to try this out.

Although I'm not sure how to also tie that in with autoscaling on Metworx. I still have to take a closer look at @billgillespie's approach to see how that works.

yizhang-yiz commented 2 years ago

Unfortunately Bill's approach is probably the best here. Note that sample_mpi function is not designed for within-chain MPI use like we've discussed. Instead the function and its "chains=" argument are for the experimental cross-chain warmup algorithm in which all the chains will communicate within each other during warmup (hence no parallel_chains arg as the chains must be run in parallel), and the function got confused when used with TORSTEN_MPI switch.

When you use sample_mpi(...chains=1) the function is just a wrapper of the what you did in cmdstan with the mpiexec command. I suppose one can use purrr::map + cmdstanr::sample_mpi to slightly improve ergonomics, but one must pay attention to the hostfile passed to each mpi run, as using the same hostfile in multiple mpi runs doesn't necessarily distribute the runs evenly across the available nodes. In our context of a metworx slave cluster with fixed IPs, a hacky solution is to pass different hostfiles to different sample_mpi calls.

metrumresearchgroup / Torsten