passaH2O / dorado

For shallow-water Lagrangian particle routing.
https://passah2o.github.io/dorado
MIT License
54 stars 9 forks source link

Code refactoring #15

Closed elbeejay closed 4 years ago

elbeejay commented 4 years ago

Code Refactoring

This PR addresses points raised in #12 and #13 by doing the following:


Old Draft Text

Most of this PR will be code refactoring to help address points raised in #12 and #13.

This is branched off of #14, and should only be merged after #14 has been merged. The handful of commits associated with #14 will then disappear from this PR.

Checklist

elbeejay commented 4 years ago

@wrightky - I think the bulk of this restructuring effort is complete (minus all documentation changes) but I'd appreciate your thoughts on some decisions I made that went beyond the scope of our discussion about this, this is all related to the new generate_particles() function. I still need to write some additional unit tests to make sure all of these things work, but would like to hear your thoughts on the following:

  1. Should we preserve backwards compatibility?

    • The way I've written it now, we still allow Np_tracer, seed_xloc and seed_yloc to be assigned to the modelParams class and these values will be used if dorado.Particles.generate_particles() is called without any arguments. I left this capability there because without it, the high-level steady_plots() in routines.py requires more inputs and is less "clean" (so not the best justification, I'll admit). The recommended workflow would be to assign model parameters to modelParams, and then call dorado.Particles.generate_particles(Np_tracer, seed_xloc, seed_yloc), so wanted to know how you felt about preserving the optional ability to assign particle information to modelParams.
  2. What is the best way to assign specific particle seed locations?

    • The new generate_particles() function allows this in a bit of a roundabout way, but I didn't want to start packing too many capabilities into this one function. This could be done using generate_particles() by doing something like the following (assuming a Particles class has been initialized as particles):
      xloc = [1, 20, 16]  # x locations for particles 
      yloc = [1, 12, 33]  # y locations for particles
      np_tracer = 10  # 10 particles at each location
      # assign the first location
      walkdata = particles.generate_particles(np_tracer, xloc[0], yloc[0])
      # loop and assign the rest
      for i in range(1, len(xloc)):
      walkdata = particles.generate_particles(np_tracer, xloc[i], yloc[i], previous_walk_data=walkdata)

Now this seems clunky, but this framework should allow for particles to be put in different regions with ease, or to inject them at different times by simply "generating" more.

For different regions:

# region 1 has 100 particles
x1 = list(range(0,10)) 
y1 = list(range(0,10))
np1 = 100
# generate particles for region 1
walkdata = particles.generate_particles(np1, x1, y1)
# region 2 has 200 particles and is different
x2 = list(range(20,50))
y2 = list(range(20,50))
np2 = 200
walkdata = particles.generate_particles(np2, x2, y2, previous_walk_data=walkdata)
# then these two clusters of particles could be simulated together with
walkdata = particles.run_iteration(walkdata)

For continual insertion of particles through time:

# define region to insert them
xloc = list(range(0,10))
yloc = list(range(0,10))
np = 100  # insert 100 particles every time
walkdata = particles.generate_particles(np, xloc, yloc)
walkdata = particles.run_iteration(walkdata)
# loop through time (say 10 more timesteps) and generate additional particles every timestep
for t in range(0,10):
   walkdata = particles.generate_particles(np, xloc, yloc, walkdata)
   walkdata = particles.run_iteration(walkdata)

So I am overall a fan of the framework, but I am wondering if there is a better way to assign particles to a set list of locations that doesn't make the generate_particles function overly complicated.

Happy to hear any other thoughts/concerns about the rest of the PR as well.

wrightky commented 4 years ago

I've glanced through the code changes a few times and haven't seen anything noteworthy to comment on, it all looks good so far. But I'll check more in-depth before the merge later on. About these specific points:

  1. Backwards compatibility

I didn't think about how the new generate function would complicate the workflow of routines, bit hard to know what the right call is here. I like that keeping the option for seed locations in modelParams keeps the workflow simple for default settings. However, it also feels like it prioritizes one possible seeding mechanism, arguably at the detriment of others. I think that the main utility of the routines functions isn't the particle-instantiation, it's the standardized loop-and-plot workflow. If we removed particle generation from inside the routine entirely, it could then be used with any kind of particle_generator function call. Maybe reworking this function now to add that extra step actually means increased ease-of-use later.

It also still feels a bit like a divided-duties problem if modelParams is storing variables related to particle generation. I think if we want the ability to call generate_particles() without any inputs, it would be cleaner if we just established some defaults for that function that aren't stored in modelParams (e.g. seed locations defaulting to any off-border cells in the array, Np_tracer defaulting to maybe 100). But that's just an idea, I don't feel too strongly about that.

In lieu of a bunch of inputs to steady_plots, it sounds like the easy option would be to refactor examples to separate out those steps:

particle = Particles(modelParams)
walk_data = particle.generate_particles(arbitrary settings)
routines.steady_plots(particle, walk_data, etc)

It's less clear to me how we could modify unsteady_plots to be more general. I suppose the same workflow could work to initialize it, and then things get updated inside the routine. It seems like it could make do with only modelParams and walk_data as inputs, but we would still have to re-create particle every loop... which is maybe fine? Unsure.

  1. Assigning specific locations

So, for this one, I think I'm in favor of the solution that makes the generate_particles() function more complicated if it results in simpler code on the side of the user. These solutions work, and I definitely want to retain the functionality of your latter two examples, as those are cool solutions for complex situations. But I do think we need an easy, intuitive way to feed in exact locations/times without the user writing a loop. Feels like our code is lacking if we make it hard to say put particles exactly here.

I don't think it adds much complexity to modify the generate function to maybe have e.g. a boolean settings flag,

pt.generate_particles(np, xloc, yloc, start_times, previous_walk_data, settings='exact' or 'seed')

wherein the locations are either interpreted as a range and fed into the seed machine, or as exact locations and directly fed into the dict, depending on method specified. Might even give us a place to add extra settings later, if they're minor enough to not need their own generate function. But in general, I think if we're gonna have functions like steady_plots and get_state, we may as well stick to our "hide loops under the hood" mentality whenever possible.

elbeejay commented 4 years ago

In lieu of a bunch of inputs to steady_plots, it sounds like the easy option would be to refactor examples to separate out those steps:

particle = Particles(modelParams)
walk_data = particle.generate_particles(arbitrary settings)
routines.steady_plots(particle, walk_data, etc)

I like this idea, it seems to make it clear that the generate_particles() function "generates" the particles and initializes the walk_data information. Should be possible to resolve the unsteady_plots() routine, not exactly sure how but I will think it over as I sort out the steady_plots function and any other examples related changes.

  1. Assigning specific locations

So, for this one, I think I'm in favor of the solution that makes the generate_particles() function more complicated if it results in simpler code on the side of the user. These solutions work, and I definitely want to retain the functionality of your latter two examples, as those are cool solutions for complex situations. But I do think we need an easy, intuitive way to feed in exact locations/times without the user writing a loop. Feels like our code is lacking if we make it hard to say put particles exactly here.

I don't think it adds much complexity to modify the generate function to maybe have e.g. a boolean settings flag,

pt.generate_particles(np, xloc, yloc, start_times, previous_walk_data, settings='exact' or 'seed')

I think there are 2 ways we could go about creating this kind of functionality. My preference would be to avoid tacking things on to the generate_particles() function and instead make it a bit more modular. In that vein I think there are 2 ways we could accomplish something like that:

  1. Host different types of "generator" functions in a particle_generators.py script and access these via an argument like "settings". We would accomplish this via the following:

    • Keeping generate_particles(np, xloc, yloc, start_times=None, walk_data=None, method='random') in the particle_track.py script

    • Breaking out the "random" generation to a random_generator() function within the particle_generators.py script that has the same input parameters as generate_particles. When generate_particles is called with the method='random' flag, the function random_generator from the other script will be used.

    • Adding an "exact" generator function, exact_generator(), to particle_generators.py, and similarly having that method called when generate_particles is called with the method='exact' flag.

      This would establish a clear method for adding additional "generator" functionality without turning the single function generate_particles() into a massive thing. Then new functions for generating functions could be added and 'keyed' into the generate_particles function by defining the string that will be used to call it.

  2. A slightly larger (but generally similar) change would be to establish the "generator" as it's own class (maybe in its own particle_generators.py script) with methods for each type of particle generation. These methods would basically be the functions described in 1 above. So this would add a step of having to initialize the generator class first, and then the initial particle walk data can be "generated". The biggest benefit to this workflow is that new generator functions would not have to be 'keyed' into the "generator" as they would be methods available to the class. So the API would work like this (assuming random_generator() has been defined):

    particle = Particles(modelParams)
    generator = particle_generator(np, xloc, yloc, start_times, walk_data)
    walk_data = generator.random_generator()

    This way we avoid requiring specific string keywords to define the method we want to use. And we also make the functions "known" to the initialized class via dir(generator) so someone could see all of the methods available to them without having to guess or open to documentation / source code.

Happy to hear your take and which method you think might be better for us to use here in dorado. I'm comfortable implementing either and don't personally have a preference.

wrightky commented 4 years ago

Of these two, I would certainly prefer the former. I don't think I see the appeal for having a generators class fundamentally separate from the Particles class. For one, the only thing I forsee us "generating" is particles -- and the only place I would logically expect to find functions related to "particle generation" is inside of the Particles class. The only two things we do with particles are generate them and iterate them, and it makes the class conceptually pretty empty to externalize one of those two core functions to a totally different class/script. Everything else that might happen to the particles when using dorado is either a function of the model (modelParams) or the rules of evolution (lagrangian_walker). If we were concerned with this script getting too long or complex, I would sooner advocate that we move the modelParams class to its own separate script, rather than half of the Particles functionality.

That being said, I don't actually think we need to move more of these functions out of particle_track. One key lesson we should take from Anuga is that simplifying any one script by moving most of the code into other scripts doesn't make the actual process of reading the code easier, unless they're doing fully unrelated things. If we want to avoid tacking things onto one generate_particles function, we can simply break it into multiple functions with more descriptive names, e.g. generate_particles_in_zone() and generate_particles_at_inds(), all (preferably) housed locally inside the Particles class. The advantage of this (and main downside of having a parent function that pulls from an underlying particle_generators.py script) is that we don't know what kinds of inputs would be required for every possible generator function. If we wanted to add, for example, generate_particles_at_coords(), we would suddenly need access to information not available in the default list of inputs (i.e. raster origin, grid size) -- this would mean adding arbitrary **kwargs as an input to the function housed in Particles, which becomes a mess to interpret.

Granted, we're both assuming that people will want to add particle generation functions to the source-code, rather than doing that process fully independent of dorado. It's possible that step will be application-specific more often than not. If so, then giving people a one-stop-shop generate_particles function that does either 'random' or 'exact' methods is fine, and if they want more they can map their method onto one of those two alternatives (e.g. with the at_coords example, could easily just use coord2ind and then feed that list into the 'exact' generator).

I don't really mind if we break up the functions or not, but I do think we should keep it in-class.

elbeejay commented 4 years ago

Okay that's fine with me

elbeejay commented 4 years ago

I think this is complete now. I am going to look through everything once more tomorrow to see if anything slipped through in the documentation or doc-strings. But should be review-able