Improve performance of tally read-out

cjosey commented 7 years ago

This tally replaces using the Python API to read out results from the statepoint files with a direct HDF5 read using hyperslabs. This results in a performance improvement of roughly 2 orders of magnitude during this part of the code.

In addition, I attempted to reduce the RAM usage of the code just a bit, as I couldn't run @wbinventor's 2x2 code on my computer. I have moderately succeeded (to the point that I could run it), but came to the conclusion that even using concurrent futures or whatnot would take even more RAM than full MPI. The reason being that to perform the concurrent futures map, objects have to be duplicated from the master thread to the child threads. This doubles the RAM usage from MPI where the values would solely be resident on child nodes.

Further, RAM usage isn't even close to the theoretical optimal. For the 2x2 assembly case, opendeplete takes around 3 GB, when reaction rates, state vectors, and matrices should take around 50 MB. That is a tremendous overhead if one wants to consider full core.

cjosey commented 7 years ago

I've replaced total_number and number_density with a single structure, number. It is flat, much like ReactionRates, saving a substantial quantity of RAM. This changed the sequence of nuclides going to OpenMC, so I painstakingly recreated the same order in the older version to generate the new results reference case.

I believe this fixes the issue discussed after the merge in #24. I'm not 100% sure, as the SMR model takes over 20 minutes to read the XML files, even with the small chain. The materials.xml is over 500k lines. As far as I am aware, this also closes #12 as well.

There were two duplicate nuclides in the decay chain. These have been removed.

Now to determine why the matrices aren't nearly as sparse as anticipated. The matrices have on average 24 nonzero elements per row, when it should be much less than that. Fission nuclides use nearly the entire row, but decay only nuclides use 2-3, and normal nuclides around 7. It is still a sparsity of 0.007.

EDIT: Fixed. It appears that if an entry in the matrix dictionary exists, then it is added to the CSR matrix, even if it is still zero. Fixing this reduces RAM usage another factor of 4.4 roughly.

There are only two remaining ways left to reduce RAM. 1 is to switch to low memory depletion algorithms. Currently ce_cm_c1 requires 3 matrices in RAM at any given time, yet mathematically it can get by with just one (if I remove the c1 continuity). The second is to go full MPI, which would reduce RAM by another factor of two. The issue with that is twofold:

The current initial condition setter requires the entire geometry to be created with all materials in it.
There is no trivial way to parallel write an OpenMC input deck.

wbinventor commented 7 years ago

I'm running the SMR model on Falcon with these changes (with the full chain, perhaps that is just intractable for now) and will let you know what happens. Clearly there are some computational bottlenecks in OpenMC that will need to be addressed in order to run full-core depletion (esp with regards to I/O), and I'll probably start opening some related issues on that repo in due time.

wbinventor commented 7 years ago

I was about to mention that slicing is already supported in a sense. The current implementation loads in the entire results array though, since I needed slicing for a different reason (to slice apart a big tally and perform tally arithmetic on the slices). So if you took this approach you'd need to enlighten the Tally.get_slice method to only load the memory needed by the slice. On Tue, Feb 21, 2017 at 5:08 PM Colin Josey notifications@github.com wrote:

@cjosey commented on this pull request.

In opendeplete/openmc_wrapper.py https://github.com/mit-crpg/opendeplete/pull/25#discussion_r102331154:
     self.reaction_rates[:, :, :] = 0.0
df_tally = tally_dep.get_pandas_dataframe()

Hmm, never mind. It appears there already are slices. I'll see if these can be used for my purpose.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mit-crpg/opendeplete/pull/25#discussion_r102331154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMyVOmt59zDXenBuniZQR7xRDN7HUiFks5re2BngaJpZM4MGwtR .

wbinventor commented 7 years ago

I ran the SMR model on Falcon and it did complete the first OpenMC / depletion run with the following runtimes:

Time to openmc:  4247.689886808395
Time to unpack:  80.47928524017334
Time to matrix:  1660.030348777771
Time to matexp:  213.72929072380066

The "materials.xml" file that was generated with the full chain was over 2 million lines long (135 MB). Unfortunately, the time limit on my job wasn't long enough to permit OpenMC to complete its second simulation in the two time step problem - I don't know how much of the nearly 24 hour allocation was devoted to reading in the inputs, and how much was devoted to processing roughly 175 million particles in the depleted problem. I've launched another run with a longer time limit and will let you know in a few days if it finishes both time steps.

cjosey commented 7 years ago

As a proxy estimate, the neutronics in the second step is usually 50% as fast as the first one. So the second XML read was taking >20 hours.

I'm going to investigate if we have some O(n^2) or O(n^3) algorithm for read-in.

wbinventor commented 7 years ago

Sounds like we need to greatly improve the XML parsing performance and/or switch to a binary format.

cjosey commented 7 years ago

Yeah. I ran callgrind overnight, and found of the 3.231 trillion instructions, 3.004 trillion were in __openmc_fox_MOD_getchildrenbytagname. Of that, 2.997 trillion were on the instruction:

do i = 1, size(nll)
    temp_nll(i)%this => nll(i)%this
enddo

(Lines 114-116 in said file). The rest was in __m_dom_parse_MOD_parsefile.

cjosey commented 7 years ago

Yeah, no worries. If I'm honest, there's a reason I'm not assigning anyone to inspect these PRs. I just want to have them up for a few days in case someone else (pretty much only @paulromano) was either working on or interested in working on the same thing. The incidental reviews, while nice, aren't necessary.

The only reason I've sat on this one for so long is that I was trying to figure out what to do about tally slicing. Either I modify the OpenMC API now, or keep this as is and fix it in a later PR. I think for the time being I would prefer to leave it as is.

I'm going to put a cap on the reporting of negative values at -1000 atoms for the time being. It's way too small to matter, but still larger than any negative values I've come across. I'll merge it once I get that in.

wbinventor commented 7 years ago

I'm in no rush to use this btw - I have plenty of other things to do and the depletion of my benchmarks can wait while you decide on a path forward. Part of me is interested in seeing out-of-core tally slicing implemented in the PyAPI, but perhaps that's just because it sounds cool :-)

mit-crpg / opendeplete

Improve performance of tally read-out #25

@cjosey commented on this pull request.