MPI Support - Githubissues

cjosey commented 7 years ago

This is early in development, so this one might be a week or two before I even consider merging it.

This PR replaces the concurrent futures stuff with domain decomposition via MPI. The burnable cells are split equally over processes, as are the non-burnable ones. The only intra-rank communication is:

Initial conditions / materials to burn / nuclides in system
XML fragments
power
relative error in ATS

On the note of XML fragments, I construct the XML files separately on each node, then MPI send the string, then stream each to disk. It's a bit hacky, but it helps with RAM too (I think).

This setup basically leaves the user with the option of single threaded or MPI (and MPI must be installed anyways). I believe this is fine, due to Python having no shared memory parallelism. There is no performance benefit to not using MPI under any circumstances.

In addition, I cease storing the matrices used in the EL part of the algorithm, instead constructing them whenever they are needed. This does make the algorithm slower (and more optimization will need to be focused on it), but it massively reduces RAM usage. I don't yet have exact numbers, but it looks between 10 and 15x reduction. A 72000 cell, 3000 nuclide simulation was using only 3 GB by the time OpenMC started, so such a problem should be around 5 GB total if I've done this right.

I do want to warn that this is only the second time I've written an MPI program. The first time was a pretty terrible spectral elements code.

Currently broken:

C1 continuous interpolation (which I am pretty sure the interpolation was useless anyways and might be removed soon)
HDF5 1.10 support (I have no idea why, might just be my compiler/OpenMPI combo)

Permanently broken:

PHDF5 does not support compression / filters

This minimally impacts PR #29, mostly since I deleted a few blocks of code that got changed.

To test, h5py has to be compiled with PHDF5 support. HDF5 1.8.18 works so far.

Finally, all tests passed, run in single threaded or MPI mode. More tests will have to be run before I'm sure.

cjosey commented 7 years ago

This PR will be delayed a bit longer due to the complexities of inter-program MPI. Currently, at least with OpenMPI, I cannot directly run nested MPI tasks (via os.call).

The correct way to get around this is to use MPI_Comm_spawn. Here, I can directly spawn OpenMC as a child process and (if OpenMC were to support it) directly communicate via MPI. Through strategic setting of the MPI_Info class, I can also set it up so that opendeplete can run with different MPI settings than OpenMC (for example, using --npernode 1). I cannot find any documentation for which values in MPI_Info MPICH supports.

This does create an awkward usability problem though. My python script should be using all available slots on a cluster. If I do so, spawning OpenMC will fail unless I use mpiexec --oversubscribe. This may conflict with some PBS/Slurm implementations.

I guess this is fine as long as it is documented, but thoughts?

cjosey commented 7 years ago

I just replaced everything with MPI_Comm_spawn. From a usability standpoint, this is a bit of a mess, as the recommended command for executing the code is now (assuming OpenMPI):

mpiexec --bind-to none --oversubscribe --map-by ppr:<threads per node>:node

This assumes OpenMC is compiled with both OpenMP and MPI support. It will fail without oversubscribe and performance will be abysmal without --bind-to none as OpenMP will not work.

As long as the MPI ranks does not exceed 2, the test code still gives the exact same results. Beyond that, numerical error in reaction rates becomes an issue. I will address this by rounding off reaction rates to 8 digits, similar to number density.

I'm going to add the following features before I close this PR:

Update tests with roundoffs of reaction rates.
Remove interpolation (current analysis indicates that this form of interpolation is useless).
Possibly replace the current sequential IO materials.xml write out with MPI-IO. I have little experience with this though.
Rework the specification of the object that a user passes OpenDeplete.

On the last note, I am planning on accepting as an initial condition a geometry in which cells are filled and not materials. Volume specified will be the sum volume of all cells. My code will automatically break this into distribcells. This is not very flexible, but it has the smallest memory footprint without undue complexity. Once this is done, the maximum simulation size will greatly increase, to the point where the lack of domain decomposition in OpenMC will be the limiter.

Concerns:

I am not happy checking if tallies.out is written to see if OpenMC closed. As far as I can tell, there is no way to see if a spawned MPI task terminated without said task communicating through the Intercomm handed to it.

Future PR plans:

Rework how reaction rates are stored to allow for easy computation of power.
Add metastable branching ratios.

paulromano commented 7 years ago

So I'm just learning about MPI process spawning now, inspired by this PR. One thing I'm wondering -- shouldn't we just be able to spawn one child process per python process? For example, if you had two nodes, each with an eight-core processor, we could start up opendeplete with

mpiexec -n 4 python ...

Spawn one OpenMC process per python process (which would then internally use 8 OpenMP threads), and then between OpenMC runs, the python process could distribute tasks (Bateman solves) using concurrent.futures or whatever.

cjosey commented 7 years ago

I think I could get that to work. During the initialization, I split off a new communicator that only communicates with 1 thread per node, and terminate all the other ones. Then spawning OpenMC wouldn't require oversubscription.

The only thing is that I don't know how to get concurrent futures to consistently use as little RAM as possible, as it's not shared memory. I'm still not 100% sure what gets allocated and what gets passed whenever I use it.

paulromano commented 7 years ago

After digging a little bit, I learned a few things that I thought I'd share. So concurrent.futures.ProcessPoolExecutor uses the multiprocessing standard-library module under the hood. By default, multiprocessing relies on the fork system call to create new processes. This means that the new process will have a complete copy of the virtual memory space of the originating process. However, in Unix-based OSs, the way it is implemented is by using copy-on-write, which means that no extra physical memory is used until one of processes actually writes to an area of memory. Interestingly it's also possible to change the method in which processes are started; namely one can specify that they be started by spawning entirely new python interpreters that don't inherent resources or through something called a fork server (see here).

cjosey commented 7 years ago

Hmm, with copy-on-write, so long as I avoid accidentally writing there won't be an issue. In fact, deduplicating the Python standard library, OpenMC api, initial conditions, and bookkeeping objects should actually substantially reduce memory consumption (on the SMR on Falcon problem the savings could be up to 60%).

The only downside to this change would be an uptick in the complexity of the integrators, which is problematic as the next step in my research was to implement 20 of them and test them. However, my policy has always been to minimize RAM usage, so I'll take a crack at it.

Do you happen to know how to "close" MPI ranks such that they are freed for use in OpenMC?

cjosey commented 7 years ago

Unfortunately, it does not look like it works (but it did give me some time to do some nice refactoring of the integrator code).

The issue is that unless the results components (x, rates_array, operator) are global in scope, I need to pass them to the function executor.map runs on. Unfortunately, this is not very practical. See the following examples: 1, 2, 3, 4

The first one uses 809 MB of RAM. The second one fails as it can't pickle a lambda (though I have seen that this may not necessarily be the case, but it is extremely inconsistent). The third one I ran out of RAM running. The fourth one used 832 MB of RAM, but ran slower than 1.

paulromano commented 7 years ago

Not sure what you mean by closing MPI ranks -- can you elaborate?

cjosey commented 7 years ago

What I mean is that if I run mpiexec --map-by ppr:2:node python3 <code>, Python3 gets all available resources (2 ranks per node). Now say I want to free one of those ranks per node such that OpenMC can get it without oversubscription. How would I do that?

paulromano commented 7 years ago

Ah, ok, got it. Yeah, I don't know how you'd do that without oversubscription. Oversubscription is not really a performance issue since on the Python side we just wait for OpenMC to finish, but I guess it can cause issues with batch schedulers. Is that what you've experienced?

cjosey commented 7 years ago

Well, in my experience, all the clusters I've run on (Falcon / Kilkenny) were ok with oversubscription, but that may partially be due to always requesting whole nodes. I just thought it'd be nice to reduce the user-facing complexity. I'm still going to make some effort to try to convert to single python rank per node, but without using globals it does not look promising.

cjosey commented 7 years ago

I'm going to merge it for the time being, as I cannot see any efficient approach to the problem but I'm otherwise happy with this PR. If I do figure something out, it will be a relatively minor PR later on.

mit-crpg / opendeplete

MPI Support #30