openmc-dev / openmc

OpenMC Monte Carlo Code
https://docs.openmc.org
Other
724 stars 461 forks source link

Tally module refactor #214

Open paulromano opened 10 years ago

paulromano commented 10 years ago

Now that a decent fraction of the code has been refactored or rewritten, I think it's time we focus on the tally module. There are numerous problems with the tally module:

Now that the physics module has been split into physics and tracking, we can call create fake particles when tallying to get post-collision particle attributes even when no scattering/fission event happens. This allows us to truly implement a collision estimator and also allows us to use certain filters/scores with a tracklength estimator that we couldn't before.

paulromano commented 10 years ago

As @nhorelik points out, there is no natural way to tally with lattice structures. This refactor should also address #215.

paulromano commented 10 years ago

It's also not possible to use multiple mesh filters. A refactor should have some generalization for meshes (particularly of different types).

nhorelik commented 10 years ago

I would also like to be able to tally responses with microscopic cross sections that weren't loaded for the materials - e.g. tallying a U235 reaction rate in instrument tubes that contain only air for a material. This might require some special cross section loading routines, where for instance in many cases we only need a certain MT number and not the whole set.

@bhermanmit and I were discussing that since it looks like there are a lot of different things that various people want to see come out of this refactor - and more than one person seem to be ready to work on it currently - it would probably be helpful to have a more unified understanding of what the overarching picture will look like. By this I mean some high-level and intermediate-level flow charts demonstrating the desired flow, what each subroutine should be trying to accomplish, etc. Bryan just shared a sharelatex document to us for this purpose.

paulromano commented 10 years ago

Good idea guys!

nelsonag commented 10 years ago

That would be good, can I get in on the ShareLatex doc? I have an account under my umich email address.

smharper commented 9 years ago

Hey, everyone. I'm working on adding perturbations, and that means a lot of interaction with the tally module. Can you guys add me to the ShareLatex doc? I don't want to duplicate work or diverge from your efforts.

paulromano commented 9 years ago

@smharper You should touch base with @bhermanmit as he has been doing all the development on the tally refactor.

bhermanmit commented 9 years ago

@smharper - the tally refactor effort has kind of slowed and needs help from other developers. Besides refactoring the tallies, the branch also refactors where saved variables are located to avoid dependency issues. The framework that I have developed seems to be promising and I have implemented a few scores and filters including tracklength, analog and collision estimators. The following tasks still need to be finished:

smharper commented 8 years ago

I want to make the three following changes to tallies. Let me know if any of you have thoughts on these.

Functional expansion elements

First, I want to add functional expansion elements. These will handle our PN and YN expansions now, they are might also appropriate for the @nelsonag's NDPP tallies, and I hope that they will handle @mellis13's spatial expansions in the future. This means that rather than a user specifying a scatter-P3 score, they would specify a P3 functional expansion element and import it into a tally with a scatter score.

<!-- Old: -->
<tally id="1" scores="scatter-Y3 flux-Y3" />
<!-- New: -->
<functional_expansion id="1" type="spherical_harmonics" order="3" />
<tally id="1" scores="scatter flux" expansion="1" />

These are the advantages that I think functional expansion elements will give over just more scores:

Filter elements

Second, I would also like to separate filters from tally elements.

<!-- Old: -->
<tally id="1" scores="flux total" >
  <filter type="energy" bins="0.0 6.25E-7 20.0" />
</tally>
<!-- New: -->
<filter id="1" type="energy" bins="0.0 6.25E-7 20.0" />
<tally id="1" scores="flux total" filters="1" />

One score per tally

Finally, I want to move to tally elements that only have one score. I imagine that most of the reason we have many scores per tally now is so users can reuse filter elements. But that won't be necessary if I make the above change of separating filters from tally elements. Furthermore, if I also implement the functional expansion element it will reduce the number of scores you can lump together anyway (for example, can no longer specify nu-fission scatter-p3).

<!-- Old: -->
<tally id="1" scores="flux total" />
<!-- New: -->
<tally id="1" score="flux" />
<tally id="2" score="total" />

The tallies.xml of the future

Put it all together and we have something that looks like

<filter id="1" type="energy" bins="0.0 6.25E-7 20.0" />
<filter id="2" type="energyout" bins="0.0 6.25E-7 20.0" />
<filter id="3" type="distribcell" bins="11" />

<functional_expansion type="spherical_harmonics" order="3" />

<tally id="1" score="flux" filters="1 3" />
<tally id="2" score="total" filters="1 3" />
<tally id="3" score="scatter" filters="1 2 3" expansion="1" />

Thoughts?

wbinventor commented 8 years ago

You propose some pretty cool ideas @smharper! The functional expansions are not something that I personally make much use of right now, but I can certainly see the value of separating out the functional order into its own entity in the input. The idea of separate and "shared" filter elements is one that I have long thought would simplify the inputs greatly, esp for some of us who use lots of tallies.

I think you make some good points about the scores, but I'm less sold on this account. My intuition tells me that by aggregating the data that is scored to - i.e., with includes and scores as the innermost dimensions for tally results - the cache efficiency will be better than splitting the arrays up for separate tallies. In addition, the issue I tried to address with mergeable tallies in the Python API in #383 would be exacerbated by splitting scores across tallies, unless you have plans for a clever data structure to enable much better than O(N) lookups for tallies at runtime. On Aug 2, 2015 6:56 PM, "Sterling Harper" notifications@github.com wrote:

I want to make the three following changes to tallies. Let me know if any of you have thoughts on these. Functional expansion elements

First, I want to add functional expansion elements. These will handle our PN and YN expansions now, they are might also appropriate for the @nelsonag https://github.com/nelsonag's NDPP tallies, and I hope that they will handle @mellis13 https://github.com/mellis13's spatial expansions in the future. This means that rather than a user specifying a scatter-P3 score, they would specify a P3 functional expansion element and import it into a tally with a scatter score.

These are the advantages that I think functional expansion elements will give over just more scores:

  • We'll be able to shorten the current list of scores, and we'll keep it from growing huge as we add more expansion types. (Imagine what it would look like if you had to add scores for all permutations of x-order, y-order, z-order for a spatial Legendre expansion, for example)
  • The same expansion can be used for multiple tallies. That means a user will only have to change one element, the expansion, in order to adjust the order for all of their tallies.
  • It will allow us to simplify the tally.F90 logic (expansion logic would be placed inside type-bound methods for functional_expansion types rather than score_general).
  • It will simplify input_xml.F90 because the current logic for reading the score-FN style is pretty bulky.

Filter elements

Second, I would also like to separate filters from tally elements.

- The same filter can be used for multiple tallies. Again this makes it much faster to adjust many tallies simultaneously. - This will really shorten tally.xml files when a user wants to apply fine energy structure filters to many tallies. One score per tally Finally, I want to move to tally elements that only have one score. I imagine that most of the reason we have many scores per tally now is so users can reuse filter elements. But that won't be necessary if I make the above change of separating filters from tally elements. Furthermore, if I also implement the functional expansion element it will reduce the number of scores you can lump together anyway (for example, can no longer specify nu-fission scatter-p3). - It prevents users from running into errors with tracklength and analog-only tallies. - I think it's also a little easier for new users to understand. - It will simplify our tally.F90 loops and data structures a bit since we can use 2D rather than 3D arrays. The 2D arrays will also probably make data processing in the PythonAPI a little easier. The tallies.xml of the future Put it all together and we have something that looks like Thoughts? — Reply to this email directly or view it on GitHub https://github.com/mit-crpg/openmc/issues/214#issuecomment-127097842.
nelsonag commented 8 years ago

@smharper this is looking real nice. I agree with @wbinventor's suggestion about not requiring one score per tally, even though your idea does result in much cleaner code.

Two more thoughts:

Thanks!

smharper commented 8 years ago

@wbinventor, I was operating under the assumption that the mergeable tally speedup you saw was due to more efficiency in the filter indexing (as in, you removed redundant filter indexings, which I also hope to do with the shared filters, btw), but I see now that there might be a cache issue on score dimension of the arrays. After I implement the filter thing, I would like to run some tests to see if the cache efficiency is really present and significant.

@nelsonag, I like the idea of builtin filter structures. We could make them a separate element type like

<easy_filter id=1 type="energy" name="helios47" />

or maybe just overload the bins element

<filter id=1 type="energy" bins="helios47" />

I'm not sure what you me an by "response function". Can you give me an example of something that is a response function but not an expansion?

nelsonag commented 8 years ago

I'm not sure what you me an by "response function". Can you give me an example of something that is a response function but not an expansion?

Sure, by response function I mean some general function which takes particle attributes as input and produces some output. The functional expansions you mention fall under this category. However, NDPP tallies are technically not functional expansions, at least of the data known by the Monte Carlo simulation at tally time, but are response functions.

wbinventor commented 8 years ago

@smharper the cache efficiency is related to your idea of single-score nuclides. The impetus for mergeable tallies is that the at each event, the tally system currently searches each and every tally to see if it is relevant to the event (e.g., contains a filter for the same cell as the particle current location). Hence, the time it takes to tally each event scales with the total number of tallies in the simulation. By aggregating or merging the scores into fewer tallies, the overall tally time can be made many times faster for a simulation with lots of tallied quantities, since the total search space is greatly reduced.

On a separate note, I like @nelsonag's idea of "simplified tallies." But I think that the focus of a tally refactor should be the development of efficient object-oriented and polymorphic data structures to avoid the explosive use of "if-else" and/or switch case blocks to accommodate new scores and filter types. IMHO the goal of the tally refactor should not be the implementation of new, cool filter types but to create the platform necessary for developers to easily "plug-n-play" new filters/scores into the tally system. On Aug 3, 2015 4:24 PM, "Adam Nelson" notifications@github.com wrote:

I'm not sure what you me an by "response function". Can you give me an example of something that is a response function but not an expansion?

Sure, by response function I mean some general function which takes particle attributes as input and produces some output. The functional expansions you mention fall under this category. However, NDPP tallies are technically not functional expansions, at least of the data known by the Monte Carlo simulation at tally time, but are response functions.

— Reply to this email directly or view it on GitHub https://github.com/mit-crpg/openmc/issues/214#issuecomment-127430353.

wbinventor commented 8 years ago

@smharper I thought a little bit about what you alluded to in your last comment about the possibility of searching each filter for once, rather than searching each filter in each tally. I think this would (largely if not entirely) mitigate the mergeable tally issue. The cache efficiency with scores is likely less of a big deal, though it might become increasingly important on many-core machines. @paulromano and @jtramm would know more given their multi-core performance analysis of OpenMC. It would also be worthwhile to hear @nhorelik's thoughts on your proposed ideas and how they would interact with the different approaches he has taken to tallying massive (TB+) datasets with domain decomposition - i.e., tally "scrambling" and on-the-fly tally allocation.

paulromano commented 8 years ago

For starters, I like the functional expansion concept, and agree it would greatly simplify things. Having unique filters which are associated with tallies also seems like a good way to go. It will certainly require a bit of restructuring of the code, but that's what we're talking about anyway :) As far as functional expansion versus response, I have a little trouble seeing how a response in general is different from a score. @nelsonag maybe you can elaborate on this topic some more.

I am very hesitant about the idea of a single score for each tally however. Having multiple scores in a single tally is definitely beneficial for cache re-use since accesses are linear in memory. Such a memory access pattern is only possible because multiple scores are being scored to in the same array. As far as multi-core performance and how this would affect it, I'd say it largely depends on the problem, what tallies are being used, and how many cores you're using, but in general splitting scores over multiple tallies means that you end up with more dirty cache lines, and consequently more misses due to cache line invalidation. One of the main target problems I always have in mind is a hypothetical depletion simulation where you would want to know ~6 scores in ~300 nuclides in millions of fuel regions. @smharper you ought to mention how you see <nuclide> fitting in since you didn't mention it in your original proposal.

smharper commented 8 years ago

@paulromano, I was planning to initially leave the <nuclide> input syntax and the basic location of the nuclide loop/data where it is now. But in the long term, I am considering putting that functionality into an expansion (maybe it's an abuse of the term "functional expansion"... but delta functions are still functions).

Since we are talking about things like cache efficiency, I suppose it will be beneficial to share the scoring algorithm that I had in mind. Let me iron out one kind of confusing piece now. I imagine that we will (eventually) want to allow multiple expansions per tally. This means we'll have to score all of the combinations of all of the coefficients in all of the expansions. Here's one algorithm that could do that (although in practice I would use something with more vector operations): [REDACTED see comment below]

Let's stuff that snippet into a type bound method like tally % get_expansion_mask(p). Using that, here's a rough-out of how I would write the scoring loop:

loop over all filters
  filter % matching_bin = filter % get_scoring_bin(p)
end filter loop

loop over tallies
  flux = tally % flux_estimator % get_flux()

  if flux == 0
    cycle tally loop
  end if

  filter_index = 1
  loop over tally % filters
    if filter % matching_bin == NO_BIN_FOUND
      cycle tally loop
    end if
    filter_index += filter % stride * filter % matching_bin
  end tally % filters loop

  mask = tally % get_expansion_mask

  loop over tally % nuclides
    score = tally % get_score(p, nuclide)
    t % results % value += score * mask
  end tally % nuclides loop

Some things to note:

I suppose when I envisioned this, I had really nasty expansions in mind (combinations of 3D spatial and angular expansions, maybe with NDPP tacked on), and I was worried about their memory efficiency over reaction type scores.

Since there are legitimate concerns about the reaction type scoring cache-efficiency, I think I'll add the slight modification to the above picture of making tally % get_score return an array with one entry per reaction type. Then score * mask can be a col_vec * row_vec. That will also mean getting rid of the flux == 0 idea and replacing it with something inside that nuclide loop.

nelsonag commented 8 years ago

@paulromano:

As far as functional expansion versus response, I have a little trouble seeing how a response in general is different from a score. @nelsonag maybe you can elaborate on this topic some more.

It isn't really, a score is a response function; however, a score is also a functional expansion. So my point (slightly self-serving due to the long term goal of including NDPP tallies, though I bring it up because I hope there are other examples, perhaps KDEs) is merely that 'functional expansions' might not be the correct terminology for how we may end up using this in practice. I differentiate between functional expansions and the NDPP tallies because technically, the NDPP data is not a functional expansion of what is happening in the MC simulation, i.e., it is not a functional expansion of behavior of the as-simulated neutrons, but it is instead a prediction of a functional expansion, if that makes ANY sense.
Either way, you are right, the "response function" terminology isn't very useful either since that is technically what a score is for.

smharper commented 8 years ago

It looks like I was a little hasty when I typed out that expansion algorithm. Here's one that will actually work:

mask(1 : n_combinations) = 1
loop over expansions in tally
  stride = expansion % stride
  max_order = expansion % max_order 
  for i = 1, n_combinations
    order = (i modulo stride) / max_order ! floor division
    mask(i) = mask(i) * expansion % get_coeff(p, order)
  end for
end expansion loop
nelsonag commented 8 years ago

I was thinking about the functional expansion vs response function too generally. In this case will be applied to amend any type of score by working on any of the 6 (7 if we get way advanced) dimensions that score may or may not depend on. The same can't be done with a response function. So I rescind this line of thinking :-) sorry about that. On Aug 4, 2015 9:52 PM, "Sterling Harper" notifications@github.com wrote:

It looks like I was a little hasty when I typed out that expansion algorithm. Here's one that will actually work:

mask(1 : n_combinations) = 1 loop over expansions in tally stride = expansion % stride max_order = expansion % max_order for i = 1, n_combinations order = (i modulo stride) / max_order ! floor division mask(i) = mask(i) * expansion % get_coeff(p, order) end for end expansion loop

— Reply to this email directly or view it on GitHub https://github.com/mit-crpg/openmc/issues/214#issuecomment-127816535.

smharper commented 8 years ago

Upon further thought, I would like to retract the idea of functional_expansions and replace them with scoring_functions which can also handle reaction scores.

<scoring_function id="1" type="reaction" reactions="scatter nu-scatter" />
<scoring_function id="2" type="reaction" reactions="nu-fission" />
<scoring_function id="3" type="spherical_harmonics" order=3 />
<scoring_function id="4" type="scatter_legendre" order=3 />
<scoring_function id="5" type="ndpp_energyout" bins="0.0 6.25E-7 20.0" />

<tally id="1" scoring_functions="3 4 5 1" />
<tally id="2" scoring_functions="5 2" nuclides="U-235 Pu-238" />

The cool thing about this is that you can pick which order you want your data to be stored in and looped over without worrying about if it is a reaction type, angular expansion, set of delayed groups, or whatever.

paulromano commented 8 years ago

@smharper That's an interesting idea. Can you sketch out pseudocode for how the scoring would work similar to what you showed before? And presumably there will be some restrictions on what combinations of scoring functions are acceptable?

smharper commented 8 years ago

Here's an abstract base class for ScoringFunction:

type, abstract :: ScoringFunction
  integer :: n_bins
  real(8), allocatable :: score(:)
  contains
  procedure(compute_score_sf_), deferred :: compute_score
end type ScoringFunction

subroutine compute_score_sf_(this, p, i_nuclide)
  import ScoringFunction
  class(ScoringFunction), intent(inout) :: this
  class(Particle),        intent(in)    :: p
  integer,                intent(in)    :: i_nuclide
end subroutine compute_score_sf_

So now that you have an idea how it would work, here are two examples of possible subclasses:

type, extends(ScoringFunction) :: ReactionSF
  integer :: reactions(:)
  contains
  procedure :: compute_score => compute_score_rxn
end type 

subroutine compute_score_rxn(this, p, i_nuclide)
  type(ReactionSF), intent(inout) :: this
  class(Paritcle),  intent(in)    :: p
  integer,          intent(in)    :: i_nuclide

  logical :: material_contains_nuclide
  real(8) :: atom_density
  integer :: i_nuclide, i_bin

  score = ZERO

  material_contains_nuclide = ! Insert logic to see if i_nuclide is present
  if (.not. material_contains_nuclide) return

  atom_density = ! Atom density calculation goes here

  do i_bin = 1, this % n_bins
    select case this % reactions(i_bin)

    case (RXN_FLUX)
      score(i_bin) = ONE

    case (RXN_TOTAL)
      if (i_nuclide > 0) then
        score(i_bin) = micro_xs(i_nuclide) % total * atom_density
      else
        score(i_bin) = material_xs % total

    case (RXN_SCATTER)
    ...

    ! This is where most of our big select case(score_bin) block goes

    end select
  end do
end subroutine compute_score_rxn

type, extends(ScoringFunction) :: SphericalHarmonicsSF
  contains
  procedure :: compute_score => compute_score_pn
end type

subroutine compute_score_pn(this, p, i_nuclide)
  type(SphericalHarmonicsSF), intent(inout) :: this
  class(Particle),            intent(in)    :: p
  integer,                    intent(in)    :: i_nuclide

  integer :: order

  do order = 0, this % n_bins - 1
    score(order + 1) = calc_pn(order, p % mu)
  end do
end subroutine compute_score_pn

Each tally will have an array of ScoringFunctions like those. Notice that every ScoringFunction's score is an array (in general). To build the entire score for one tally event, we need to get all the combinations of bin indexes for these different scores. For example, if we have three scoring functions with 2, 3, and 2 bins, respectively, then the entire score array will have 12 bins like this: score( 1) = scorefuns(1, 1, 1) score( 2) = scorefuns(2, 1, 1) score( 3) = scorefuns(1, 2, 1) score( 4) = scorefuns(2, 2, 1) score( 5) = scorefuns(1, 3, 1) score( 6) = scorefuns(2, 3, 1) score( 7) = scorefuns(1, 1, 2) score( 8) = scorefuns(2, 1, 2) score( 9) = scorefuns(1, 2, 2) score(10) = scorefuns(2, 2, 2) score(11) = scorefuns(1, 3, 2) score(12) = scorefuns(2, 3, 2)

To make that pattern, the tally will assign a chunk_size and stride to each of it's scoring functions. chunk_size refers to how many times each scoring function's bins are repeated consecutively. So in the above example, the scoring functions have a chunk size of 1, 2, and 6, respectively. stride refers to the distance in the score array before a particular scoring function repeats it's sequence. In the example, the strides are 2, 6, and 12, respectively. Here's how a tally could use that information to build up a score:

subroutine build_score(this, p, i_nuclide)
  type(Tally),     intent(in) :: this
  class(Particle), intent(in) :: p
  integer,         intent(in) :: i_nuclide

  integer :: i_sf, i_bin, i_chunk  ! Loop iterators
  integer :: n_bins, chunk_size, stride  ! score_fun characteristics
  integer :: first
  class(ScoringFunction), pointer :: current_sf

  this % score = ONE  ! Note that this % score is a vector.

  do i_sf = 1, this % n_scoring_funs
    ! Bookkeeping.
    current_sf => scoring_funs(i_sf)
    n_bins = current_sf % n_bins
    stride = this % scorefun_stride(i_sf)
    chunk_size = this % scorefun_chunk(i_sf)

    ! Evaluate the current ScoringFunction. 
    call current_sf % compute_score(p, i_nuclide)

    ! Apply the current ScoringFunction's response to the score for this event.
    do i_bin = 1, n_bins
      do i_chunk = 1, chunk_size
        first = (i_bin - 1) * chunk_size + i_chunk
        this % score(first : : stride) *= current_sf % score(i_bin)
      end do
    end do
  end do
end subroutine build_score

When a collision happens, we'll call score on the relevant tallies. That subroutine will look something like this:

subroutine score(this, p)
  type(Tally),     intent(in) :: this
  class(Particle), intent(in) :: p

  real(8) :: flux
  integer :: filter_index, i, i_nuclide

  flux = this % flux_estimator % get_flux()
  if flux == ZERO return

  filter_index = ! More code goes here

  do i = 1, this % n_nuclide_bins
    i_nuclide = t % nuclide_bins(i)
    call this % build_score(p, i_nuclide)
  end do

  this % results(:, filter_index) = this % score
end subroutine score

Notice that Tally % build_score and ScoringFunction % compute_score are subroutines, not functions. I would like to have made those two functions that returned arrays, but I think that will require us to regularly allocate arrays. So instead, they are subroutines that write into member arrays which only have to be allocated once at initialization.

It would be cool if there could be a NuclideSF instead of enforcing the location of the nuclide loop in Tally % score, but we would have to figure out how to couple it to ReactionSF. I'm sure we could do that, but I'll leave that for later.

@paulromano, I can't think of any restrictions we would have to apply right now.

paulromano commented 8 years ago

@smharper Thanks a lot for that sketch. Definitely helps to see a rough idea of where everything would lay, and I think I understand it for the most part. As you said, there wouldn't be any restrictions on combinations of scoring functions, so in theory a user could tally very exotic quantities if they so desired. The abstraction is beautiful, certainly better than I would have come up with on my own. I think we'll just have to be careful about performance, particularly with memory copies. I see you're already thinking about that aspect (subroutine vs function returning array) which is great. Let us know if you have a tentative idea of when you're planning on taking a crack at an actual implementation and if you'd like/need help from others.

wbinventor commented 8 years ago

@smharper from what I've seen, I would agree that this sketch looks really good. I need to look at it in more detail, esp. since the tally system is very important to my work. One question I do have is this - if we're going to all this work to do an OO refactorization, why the switch-case for the score types when computing tally scores? Why not subclass wth finer granularity down to the fundamental score types themselves? I'd be curious if this would have any impact on performance for problems with lots of tallies and lots of scores (perhaps not and branch prediction is smarter than I realize). It just seems to me that the subclasses should be as slim as possible - just ripe for the taking to be used as examples for any developer to come along and create a new score subclass.

smharper commented 8 years ago

@paulromano, I don't know when I'll actually implement this. I hope to whittle away at it and have at least a partial implementation by the start of next summer. Right now, #438 and diff tallies (gotta do a thesis sometime, ya know) are a little higher priority. As for getting help with this, I would appreciate help but I'm not sure yet how to break this change down into small tasks so that different people could work together. I'll think about it some and let you guys know if I come up with a plan for divide and conquer.

@wbinventor, I think that replacing that switch-case might be a case of getting too trigger happy with object-oriented design. OO has it's downsides:

It usually requires many more lines of code (especially in Fortran), which hurts readability. For example, you only have to make one definition of real(8) :: atom_density to use it in many case statements, but if you use OO procedures, you end up having to repeat that definition over, and over again.

It takes a block of code and scatters it out across several different parts of several different files (especially in Fortran), which hurts readability. With select-case, all the code is in one big monolithic block. If you replace that with reaction % get_score and a developer wants to know what that does on the inside, then they'll probably have to go to a different file like tally_header.F90 or reaction_header.F90; in that file, they'll find the definition of class :: Reaction at the top which gives them the deferred get_score function. The definition of that get_score function is probably near the Reaction class, but separate in its own abstract interface. Then the developer has to find the type, extends(Reaction) :: FissionReaction which will point to a procedure that is again, in a different part of the F90 file.

It really does hurt performance. Check out this simple comparison I threw together. When I call run_tests.sh, I see that the OO version takes about 3.5 s and the procedural version takes 3.2 s. That's small, but it gets worse if you go to real conditions where the objects will be in a different file. Move the funct_header module into it's own file, and turn on optimization in run_tests.sh (change -g to -O3, don't forget to add funct_header.F90 to the argument list) and I see the more dramatic difference of 1.7s vs 1.0s. (I assume this isn't really a fault of OO itself, but the lack of link-time optimization when you move objects to different files. For a more extreme example of this difference in optimization, remove the write statement and run the test. The single-file version will be whittled down to nothing in the dead code removal and it will run instantly.). Also, look at the size of the binaries. The OO version is a few kB bigger. In that tarball, I included the objdump for each binary. You can look at that to understand why we get a little worse performance with OO. Here's a section of the OO objdump with some junk removed:

    ptr => idf
  40094a: mov    QWORD PTR [rbp-0x18],0x400ba0
  400952: lea    rax,[rbp-0x4]
  400956: mov    QWORD PTR [rbp-0x20],rax
    y = ptr % compute(x)
  40095a: mov    rax,QWORD PTR [rbp-0x18]
  40095e: mov    rax,QWORD PTR [rax+0x20]
  400962: lea    rcx,[rbp-0x28]
  400966: lea    rdx,[rbp-0x20]
  40096a: mov    rsi,rcx
  40096d: mov    rdi,rdx
  400970: call   rax

There we can see a hint of how polymorphism really works. Look at line 94a. We are actually calling that method at location 970 with call rax. But what is rax? From locations 95e, 95a, and 94a, we can see that rax is really 0x400ba0 + 0x20. If you look further down in that file, you can see that the other calls to compute boil down to call 0x400c20 + 0x20 and call 0x400be0 + 0x20. Here's what I think is going on: 0x400ba0, 0x400c20, and 0x400be0 contain pointers to our types, IdentityFunct, SquareFunct, and PolyFunct. Since they are all Class(Funct), we know they contain at compute method 0x20 after the beginning of their data. What's more, I doubt the actual implementation of compute can be found at 0x20; 0x20 is probably itself a pointer to the actual implementation of compute (otherwise, we wouldn't now where to put the definitions for other methods of Class(Funct).

So you see, when we call ptr % compute(x), we have to travel down a long chain of pointers. It's ptr-to-variable-in-stack ==> ptr-to-compute-function => ptr-to-compute-implementation. And this isn't cache efficient either because the main loop, the object, and the procedure implementation are in three different places in the code (very similar to how the fortran source code looks, really). Contrast this with the procedural objdump:

  select case(f_type)
  40082d: mov    rax,QWORD PTR [rbp-0x20]
  400831: mov    eax,DWORD PTR [rax]
  400833: cmp    eax,0x2
  400836: je     400853 <compute_+0x36>
  400838: cmp    eax,0x3
  40083b: je     400866 <compute_+0x49>
  40083d: cmp    eax,0x1
  400840: jne    4008c7 <compute_+0xaa>

...

    y = compute(x, 1)
  400900: lea    rax,[rbp-0x10]
  400904: mov    esi,0x400ac0
  400909: mov    rdi,rax
  40090c: call   40081d <compute_>

Here we know the exact location of compute at compile time, hence call 40081d. Then, when we look at the select case statement, we see a distinct lack of pointer nonsense. Before, we had to go find a pointer at 0x400ba0 + 0x20 and then follow that to the implementation. Now, we know at compile time that we simply have to jump to 40081d which will quickly tell us to jump to 400840.

Hmm, I had more to say on that topic than I thought I did. Anyway, OO gives us a lot of advantages when we need a deep tree of inheritance, or if you have a bunch of different functions that are really complex. But the reaction scores are not really complex. They are little snippets of code, and because of this I think we will get better readability and better performance by leaving them in select case statements.

smharper commented 8 years ago

Addendum: It's also possible that the performance differences that I mentioned will actually be unnoticeable in a real OpenMC simulation. And compilers can also remove the differences under the right circumstances (my previous example runs at the same speed with -O3 if you keep the OO module with it's program, for example). So we should probably try abstracting away the select case and look at the performance empirically, but I think that experiment should be a low-priority because of my previous arguments.

wbinventor commented 8 years ago

@smharper if you had spent all the time you used to so thoroughly debunk my bad OO idea, we would have a refactored tally module already!

You raise some great points. I definitely agree that breaking up the switch-case would likely lead to a larger codebase, and one which might be more convoluted to follow. These reasons alone might be enough to keep the switch-case around.

I'm less convinced by the performance arguments, re your addendum above. For large simulations where performance matters most, tallies will have more than one bin. The pointers to link to the subroutine implementation may be cache misses when scoring to the first bin, but will be readily accessible in L1 for each successive bin. This is just my intuition, I haven't taken the time to do any experiments as you have done, so this may be a poor analysis.

Finally, I certainly don't know if there would be any performance improvement by breaking away from the switch-case. I do think that avoiding conditionals, we might open the door to a little SIMD vectorization when scoring tallies in the future. Speaking of which, regardless of how the refactor is implemented, we should consider how to structure the algorithm to score to all tally bins in a more SIMD-like fashion.

smharper commented 8 years ago

@wbinventor, yeah, maybe I got a bit carried away there. I wrote that sentence "It really does hurt performance," and then I realized that I was just repeating things I had heard previously without knowing why they are true. So next thing I know I'm working my way through an objdump. Anyway, it was a fun exercise. 8/10, would waste time on it again.

Re: vectorization, Yes, definitely something to look into.

paulromano commented 8 years ago

My gut instinct is that OO will hurt peformance at that fine-grained a level, but like @smharper said it's just heresay and I don't have any numbers to back it up. I'm sure all of us will be interested to see hard evidence one way or the other. It's probably also very compiler-dependent at this point.

Vectorization is a tougher nut to crack and one that would probably require wholesale restructuring of much of the code... so I wouldn't overly worry about it for now. No area in the code currently takes advantage of vectorization so we have a long way to go.

mellis13 commented 8 years ago

@smharper Quick question on the tally refactor. With the refactorization how hard do you think it would be to compute a covariance matrix for FET coefficients? Eventually FET coefficients are used to reconstruct a tallied quantity at a point, and therefore the only way to get non-approximate uncertainty on the quantity at a point is to have the covariance matrix. Probably less of an issue, but maybe it would also be useful for @wbinventor tally arithmetic if covariances would easily be calculated (feel free to say if it is not really needed). Thanks for doing all of this work, @smharper !

smharper commented 8 years ago

@mellis13, sadly my statistics knowledge is failing me here. I don't know how difficult it is. Let's talk about this some in person in a couple weeks.

mellis13 commented 8 years ago

Sounds good! Thanks @smharper

paulromano commented 8 years ago

One thought I had as I was looking through the tally module the other day -- right now some filter logic uses attributes of the tally directly, e.g. if the tally uses an analog estimator, the filter does something different than if the estimator were tracklength or collision. If we really want to make filters separable from tallies, we'll need to remove that dependence somehow. Not sure if you've already thought about this @smharper.

paulromano commented 7 years ago

Just wanted to give an update for anyone following this issue. We now have filters implemented as their own objects thanks to the excellent work of @smharper and @amandalund. This summer, @LMKerby and her student Brittany Grayson are looking into implementing functional expansions as a new filter type. After some discussion, we decided that this was a simpler path forward than introducing a separate abstraction for expansions. The existing 'expansion-type' scores (like scatter-Pn) will be replaced by using, in this case, something like LegendreAngleFilter along with just a regular scatter score. The same can be done for spherical harmonics expansions. Beyond replacing those scores, we're also planning on adding filters for functional expansions in space (Legendre, Zernike) that can be used for coupling.

LGKerby commented 7 years ago

Thank you Paul.

Leslie M. Kerby, Ph.D., M.B.A. (208) 533-8118 Assistant Professor Nuclear Engineering and Health Physics Idaho State University

Research Scientist Nuclear Systems Design and Analysis Division Idaho National Laboratory

On Jun 29, 2017 2:50 PM, "Paul Romano" notifications@github.com wrote:

Just wanted to give an update for anyone following this issue. We now have filters implemented as their own objects thanks to the excellent work of @smharper https://github.com/smharper and @amandalund https://github.com/amandalund. This summer, @LMKerby https://github.com/lmkerby and her student Brittany Grayson are looking into implementing functional expansions as a new filter type. After some discussion, we decided that this was a simpler path forward than introducing a separate abstraction for expansions. The existing 'expansion-type' scores (like scatter-Pn) will be replaced by using, in this case, something like LegendreAngleFilter along with just a regular scatter score. The same can be done for spherical harmonics expansions. Beyond replacing those scores, we're also planning on adding filters for functional expansions in space (Legendre, Zernike) that can be used for coupling.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mit-crpg/openmc/issues/214#issuecomment-312103995, or mute the thread https://github.com/notifications/unsubscribe-auth/AQoh8e7CHTXw0QG3H5Xsx8HN4Z-bBYvOks5sJA4BgaJpZM4A_5T0 .