In-memory coupling for depletion

paulromano commented 7 years ago

Here's the start of my work on in-memory depletion coupling using the Python bindings to OpenMC's C API. I'd still like to look deeper into memory use issues, namely:

I'm using a multiprocessing.Pool to parallelize the burnup solve (functionally equivalent to what we had before with concurrent.futures). I'm pretty sure that the memory use from this is not too bad, but I want to do some more investigation to present a complete story.
To update number densities in OpenMC at each iteration, the values for each process's set of materials is all-gathered. There may be a better way to do that.

Some notes on the implementation:

XML files are still written at the beginning of the simulation but then after that, all updates of tallies and materials happen in memory. As the OpenMC C API grows over time, we should eventually be able to avoid writing materials.xml and tallies.xml files altogether.
The master branch currently requires OpenMPI + PHDF5 and a version of h5py built off of that PHDF5 which can be a major hassle to get set up correctly. This branch works with any version of MPI and HDF5 (parallel or not).
This branch also works without MPI via a new DummyCommunicator class that mimics a normal MPI communicator but is essentially a no-op.

As I said, I'm still planning on doing more work to address memory concerns, but I just wanted to start this PR in case others want to take a look.

cjosey commented 7 years ago

I managed a short window to examine the code. First, this does a great job cleaning up things. There was a lot of immensely complicated horrible blobs that have been removed. I'll miss the MPI parallel materials.xml writer (I was proud of that).

There are a few things I'm concerned about.

I know it's not finalized/optimized, but:

number_list = comm.allgather(self.number)

My understanding is that the allgather operation serializes the object and then sends the serialized data, whereas the capitalized functions (Allgather) use MPI native types (and only work on numpy types) and should outperform them. Now, we're rarely communicating anything of significance (I think the other critical communication is power, which is 1 float per rank) except for this single block. So I wonder how this impacts performance.

My suggestion on this front would be to instead keep a list that basically maps "material ID" to "rank who owns it". Then, on communicating materials, loop on materials IDs, Bcast the slice (self.number.get_atom_density[mat, :]) with a root of the rank that owns it. If I coded this correctly (big if there), the nuclide order should be the same on all nodes, so the nuclides list can be made before the loop and reused. I'm not sure if this will be faster (how much overhead does MPI have?), but it'd use less memory.

You mentioned that it works with any version of MPI or HDF5 parallel or not. To my understanding, if the mpio driver is used (which it is if MPI is used), then the HDF5 has to be PHDF5. Is that not the case?

The final one is the reference solution for the test bench changed. I believe, in analyzing the code, that the reason it changed is the order of materials or nuclides has changed. At some point I'll try to get the old version of the code to duplicate this result, just to form an unbroken chain.

I still (as always) have concerns about memory, but I'll wait until you investigate that yourself.

cjosey commented 7 years ago

Alright, let's start the review.

I did some analysis of memory utilization. Here is a script I used to automatically monitor the memory usage of OpenDeplete. I make sure the command line contains the arbitrary string "zdrrdf", and every 0.1 seconds it records the PSS memory utilization of all components of the script. I did this both for current OpenDeplete and PR 39 on the exact same dataset and on the sample problem in scripts. For the original code, out of fairness, I calculated memory both with and without OpenMC, as PR39 runs it as a library. This is all run with 1 MPI thread on a 4 core machine. The plot is shown below:

plt1

There's some really odd artifacts. First, current develop takes way more RAM. It also appears to have some sort of memory leak causing growth of the python memory utilization. PR 39 also has an apparent memory leak, but only later on in the simulation and only during the matrix exponentials. I genuinely do not know if this is real, as it only appeared once. PR 39 on average uses substantially less RAM. Actual time comparisons are invalid as I was using the computer for other things while I waited.

To further investigate this, I increased the number of regions in the model by a factor of 20 (and decreased neutrons by a factor of 10). I got bored waiting, so I terminated these after 5 minutes.

plt2

This indicates that the memory leak is problem size independent. The initial jump is, to my knowledge, due to the fact that so many nuclides are added in the first time step. However, it is concerning that it happens in the second simulation and not the first. I ran into a number of memory bugs in my private branch, so I'll see if these are the same.

The final test I ran was to force PR 39 to fork into more processes to see if memory utilization grew substantially with more threads. I did this by adjusting the Pool() calls in cecm.py. I also doubled the problem size to extrapolate the overhead. The resulting plot shows that the memory usage grows with number of threads. It's sometimes but not always linear.

plt4

So, overall, I see three issues with memory.

The memory pool grows with both problem size and number of threads.
(For multi-node) the entire problem must exist on the head node during XML write.
(For multi-node) the entire problem is allgathered during _update_materials

Once these are fixed, PR 39 should substantially outperform current develop.

Now, to comment on the actual code.

paulromano commented 6 years ago

@cjosey I've been working on expanding the C API in OpenMC; I actually have a version of this branch that handles all the tally data in memory (i.e., a tallies.xml is never written) based on the updates. I may try to get materials handled entirely in memory as well. As I make those changes, I'll also try to address the memory issues you've raised here. My first order of business is to get another PR to OpenMC that includes the capabilities that we'll need here. I'll let you know when this is ready for another look.

paulromano commented 6 years ago

@cjosey This is ready for your review once again. It relies on mit-crpg/openmc#920, so I wouldn't advise merging this until that PR is merged. I think most of the issues we discussed have been resolved:

Running with mpi4py works now
I've replaced the allgather over number densities with a series of broadcasts
The MPIO driver is used for HDF5 only if parallel HDF5 is available

Regarding the two other memory issues you raised:

"All number densities must exist on head node at initialization" -- not really a good way around this as long as OpenMC itself is not domain decomposed. I may do another PR to create materials in memory rather than relying on XML, which would at least avoid the extra memory on the Python side.
"The memory pool grows with both problem size and number of threads." -- this is not surprising to me. That it grows with the problem size is natural. That it grows with the number of threads (really, number of processes) is not surprising because each time Pool() is called, it creates as many processes as you have cores on your machine via fork. As I mentioned to you a while back, the forked process shares pages until there are writes, in which case a new page is created (copy-on-write). So, any Python object that is touched in the worker process that is doing the CRAM solve will result in copy-on-write and therefore increase memory. If you have more processes in your pool, each of them has its own copy of every object that it touches.

cjosey commented 6 years ago

I'll be able to thoroughly investigate this next week once I have a computer again.

There is one bug I do know exists and forgot to mention. Since the OpenMC initial materials is made using self.geometry, and the initial tallies are made using self.number, if these two are discrepant (as they are when self.settings.dilute_initial is non-zero and the initial condition does not contain some nuclides) the simulation will crash due to tallies asking for non-existent nuclides.

As for the last two issues, I sort of disagree. In the first case, while OpenMC needs the entire expanded geometry, it is not necessary that OpenDeplete does, which would double the memory. However, this is a discussion that has played out quite a bit.

For the second one, I strongly disagree. When we do the starmap, we take a tuple of rates. Let us assume that the memory has yet to be duplicated and the tuple contains pointers to numpy arrays. This is passed to a starmap. Now, each thread should only touch the object in each tuple that it needs inducing a copy-on-write. This would cause a doubling of RAM but not a linear increase with threads. It appears that the entire tuple is being touched and copied. So, while it should be not good but passable (doubling), we're getting a linear increase with threads.

On a simple test problem (SMR ECP, ENDF-7.1 full chain, on Falcon with 16 nodes, 24 threads per node, assuming no overhead or duplication, predictor algorithm), I computed the ideal minimum memory utilization. For the current code, that value is 1.8 GB per node. For the new code, it's 7.5 GB. If we got mere doubling, it would be 2.1 GB. The quadrupling is a very hefty price to pay for simplification.

Regardless, I will have to test this on Falcon when I can actually get the chance. Odds are that with all the issues and additional overhead the current code has, it may still lose on memory to the new branch. However, it will be completely unusable on the Xeon Phi.

paulromano commented 6 years ago

I've noticed that bug too (re: dilute_initial). I'll look into getting that fixed.

I don't think I fully appreciated what you meant when you said the memory grows with number of processes. You're saying that (possibly due to the use of starmap), it grows more than it needs to, and with your example I don't have any reason to doubt what you're saying. I'll look into some of the other methods for multiprocessing.Pool to see if there's a better way to pass the rates between processes.

cjosey commented 6 years ago

It may simply be that once the pointer is de-referenced, it sees the entire numpy object and copy-on-writes that instead of only the slice it was handed. Unfortunately, I simply do not know enough about the low-level architecture to know for sure what is happening.

paulromano commented 6 years ago

One thing I just found that makes me wonder whether it has an effect -- apparently the default chunksize is len(materials) / (len(pool)*4) (see here). Can you try adding a chunksize=1 argument to starmap and running on Falcon to see if that fixes the problem?

paulromano commented 6 years ago

@cjosey How are you handling C0 in the SMR model? I'm running into a problem because the model uses elemental carbon but our ENDF/B-VII.1 chain file has individual carbon isotopes. Did you make a modification somewhere?

cjosey commented 6 years ago

I think C0 is ignored completely. Nuclides outside of the decay chain are not considered. It should remap it to isotopic carbon, but that's not a problem that can be solved correctly until ENDF/B-VIII. At current, I would say that the initial condition is ill posed.

I did an extensive amount of research on how multiprocessing should work. To my understanding, arguments are pickled and unpickled unless they're shared ctypes. Globals will only have GC-related pages duplicated. It appears that things are only duplicated when passed, so adjusting the chunk size does indeed benefit us. I also found my script has a resolution issue. As threads increase, the time to read memory increases, usually exceeding my 0.1s time step. I made a series of modifications to improve reading time.

A plot of memory with time (approximately, dt skews in the peaks) is shown below.

test

I see three issues. The first is that there seems to be a memory issue between time steps which is so large as to swamp out any benefits tweaking the mapping will gain. I suspect that something we're doing is confusing the garbage collector badly. Second, if one only compares the first step, globals provide quite a sizable benefit, with the consequence of an untrustworthy code.

It's not worth investigating until that memory growth is isolated, but it may be best to use shared ctypes in the long run as this eliminates the issues that globals pose. Of course, the implementation complexity of shared ctypes may be greater than simply using MPI.

Also worth mentioning, I disabled the actual CRAM part of the code (I form the matrix A but then return a copy of the initial vector), as the matrix inversion requires a ton of scratch space (20MB/thread for the small decay chain). I need to consider a different solver. Sadly, UMFPACK is one of the fastest exact sparse solvers.

paulromano commented 6 years ago

@cjosey I've updated this branch to fix the bug related to dilute_initial. I've also changed OpenMCSettings.power to accept a value in Watts as discussed in #40. Let me know if there are any outstanding issues I need to address before we can merge this. Namely, do you want me to adopt chunksize=1 for the multiprocessing Pool?

cjosey commented 6 years ago

So, I finally had an opportunity to test this on Falcon with a 1600 region problem and a simple decay chain.

Here are my observations:

At the end of the second OpenMC run, between the matexp and before the start of transport, the simulation pauses for a minute or two. Now this might be the writing to disk phase where the HDF5 objects are made, but I don't know. If it is, then this is a bit worrying. The output for this problem is a mere 20 MB/timestep.
Switching to chunksize=1 does not reduce memory much if at all.
The memory difference between MPI and not is roughly 0.
For this test problem, the in-memory coupling has vastly superior performance. The "unpack" phase goes from 221 seconds to 6.
The full test suite segmentation faults and has errors. Not sure why.
The readme is out of date. HDF5 needs to be compiled shared (I think), and in general the explanation is now wrong.

I would test the test suite on my local install instead of Falcon to make sure, but another bug appeared in that the update function in scipy.sparse.dok_matrix was removed without deprecation. My python setup on Falcon is older and isn't missing this function.

Here is a plot of memory usage for all three tests (time isn't to scale for "old" as the number of processes lagged scraping the memory usage):

test

So, while I feel that this modification adds some significant drawbacks (allgather for initial XML write, and those absolutely massive matrix exponential spikes), for the time I consider it alright as the old code wasn't any better. Once it passes all tests, I'll merge it.

paulromano commented 6 years ago

@cjosey That's annoying re: scipy.sparse.dok_matrix.update being removed. It is still available now as an _update method, but that is not available in previous versions so I've made a hack whereby dict.update is called directly, taking advantage of the fact that dok_matrix is actually a subclass of dict.

cjosey commented 6 years ago

That fix got it running, but there are still test issues.

For one MPI rank, the test fails on test_full. I thought it was due to the slightly different roundoff error introduced with the new power specification, but reverting that did not get the old results back.

For multiple MPI ranks, it appears that the deletion of test_integrator_regression fails. After showing a few errors, the test suite stalls and never completes.

paulromano commented 6 years ago

I'm not getting the same errors you're seeing. Whether I run with one or multiple ranks, test_full still passes. I'm not sure what you mean with respect to test_integrator_regression -- I haven't removed any tests in this PR.

My setup: Ubuntu 17.04, gcc 6.3, OpenMPI 2.1.1, PHDF5 1.10.1, all the latest Python packages from PyPI

cjosey commented 6 years ago

Strange. I'm running almost the same configuration (except GCC 7.2, PHDF5 1.8.19 since 1.10 never passed tests in parallel mode). I also made sure I was using latest develop on OpenMC with no modifications (no optimize). I'm still getting the same result.

I'll rebuild my toolchain with PHDF5 1.10.1 this weekend and see if that's the issue. I know that files generated with 1.10 aren't 100% backwards compatible, but I should hope that it would error out instead of getting results not quite identical to the original.

One possibility is the last change for the test results was on July 30th, but roundoff errors from going from MeV/s to W should have been enough to change the results. Are you sure your branch and your local files are synced?

paulromano commented 6 years ago

I just did a fresh clone of the repo and I'm still seeing everything pass. I also tried running everything with MPICH and PHDF5 and still had no issues.

cjosey commented 6 years ago

Apologies for the delay. I have very little free time these days, and the issue was more complex than anticipated. As it turns out, the OpenMC tests pass in GCC 6.4, but all fail in GCC 7.2. This carries over to OpenDeplete. Changing GCC versions is the only thing I did between working and not working.

I installed GCC 6.4 and all tests pass. As such, I'm merging.

paulromano commented 6 years ago

Thanks @cjosey. I'll try to get a build with gcc 7.2 to see what's up there.

paulromano commented 6 years ago

In case you're curious, I discuss the cause of the gcc 7.2 failures and a workaround in mit-crpg/openmc#936.

mit-crpg / opendeplete

In-memory coupling for depletion #39