quaquel / EMAworkbench

workbench for performing exploratory modeling and analysis
BSD 3-Clause "New" or "Revised" License
124 stars 91 forks source link

Non-uniform distributions for uncertainties #46

Closed jpn-- closed 5 years ago

jpn-- commented 5 years ago

It would be desirable to be able to specify non-uniform distributions on uncertainty parameters. the use case is less compelling to do so for lever parameters, but given the common class definition I see no reason not to allow this.

A complication with incorporating this into the current interface is that the Parameter class currently explicitly calls for lower and upper bounds in the constructor, but not all distributions have such bounds.

A proposed solution: give the bounds default values of 'None' and add a dist argument to the Parameter class, which accepts an object of type scipy.stats.rv_frozen. Upper and lower bounds can be inferred from the dist argument using dist.ppf(0) and dist.ppf(1). Then change the code in samplers.py to draw from the frozen distribution. It may be easier to generate a frozen distribution (by default a uniform) for every parameter on construction even if the dist argument is not given, although I'm not sure how much overhead this will create in running the workbench.

And: for some (many) distributions, the bounds may be infinite. Allowing infinite bounds open some new challenges in other parts of the code (notably visualizations) that would need to be overcome. For example, a large size latin hypercube sample from an unbounded distribution with a long tail would have a small number of very extreme values, and allowing automatic range detection for figures without accounting for these outliers would make the outputs unusable. It is not clear if other errors may be introduced elsewhere as well.

Possible methods to address infinite bounds:

  1. Let it go, user beware.
  2. Check for infinite bounds on Parameter construction, and either disallow via an exception or throw a warning.
  3. Generate a wrapper that creates a truncated distribution when desired. We can use the lower and upper bound arguments from the constructor to define the truncated range, and adjust the ppf method to draw from the truncated range only -- ideally all the methods of a scipy.stats distribution would be there too, but if we only want to make random draws from a truncated distribution that shouldn't be too hard.
quaquel commented 5 years ago

I agree that this is a useful addition. Others have been asking for it as well.

A few thoughts:

  1. Rhodium has something like this already. It might be worth checking how they do it and see if we can use the same or similar API.
  2. Alternatively, we might have a separate 'constructor' for non uniform parameters. This would keep the API the same for the most common use case, while offering the required flexibility. For example, we might do something like
import scipy as sp

dist = sp.stats.beta(2.31, 0.627)

uncertainties = [RealParameter.from_dist("unc1", dist),
                 RealParameter("unc2", 0, 5.5]]

Regarding infinite bounds, I doubt it will be much of a problem for most analysis stuff. All the plotting modules focus on outcomes. Only CART and PRIM might be used to visualise the sampled uncertainty space. I am thus inclined to go with your option 1: document it carefully and let the user deal with it. This is in line with Python's general philosophy that we are adults here and thus there is no need to bulletproof everything.

jpn-- commented 5 years ago

I checked out Rhodium's non-uniform capabilities. It looks like beyond uniform they enable normal and lognormal by hard-coding specific derived classes for these distributions. Keeping the workbench API close to Rhodium is a good plan when possible, but I think we are in agreement this isn't the correct path for the workbench.

I also think it's preferable to keep things simple and maintain only one way to make random variable draws from a distribution in the back end. This means if we want to accept an arbitrary scipy.stats rv_frozen as the distribution, we should just create a uniform one on parameter construction if none is provided. I played with the alternate constructor idea, but I found it easier to have one constructor that accepts a dist argument, or lower and upper bounds (as before) but not both -- explicitly raising an exception if both are provided, as I've done in #48. The current implementation has lower and upper bounds as mandatory positional arguments, so making them optional but validating the arguments as I have done should break nothing in any existing code.

My preliminary use testing indicates no signification speed impact in attaching a rv_frozen to every parameter instead of current practice -- turns out current code technically creates a new uniform rv_frozen for every parameter anyhow, so this should generally be just as efficient (the complexity of the non-uniform distribution itself notwithstanding, as that complexity is baked in to using it in the first place).

quaquel commented 5 years ago

Thanks for checking Rhodium. Clearly, that is not a route we want to take.

I have been doing some reading on the most pythonic way to have alternative constructors. Basically, we hare here 2 alternative constructors:

  1. name, lower_bound, upper bound --> uniform distribution
  2. name, initialised scipy dist --> whatever non uniform distribution you want

The consensus seems to be to use class methods, or so called factory functions. So you would need 2 distinct factory functions (the default one, needs a better name), and the from_dist approach. Next, the first factory function would create the uniform dist, followed by a call to the actual init, while 2 could do some checking (e.g. is the dist integers only for an IntegerParameter / CategoricalParameter), followed by a call to the underlying init.

I do see your point on simplicity and simply check if either lower_bound and upper_bound, or dist is provided. For now that might be sufficient, although it is possibly a bit more difficult to document. Moreover, it breaks the python idea that there should always be 1 obvious way to do something. Having separate factor methods for creating parameters from a dist, or with a uniform dist creates this clarity.

In short, I haven't made up my mind yet regarding the preferred API for this. The other changes are largely fine in your PR. Although, what is your idea behind the UniformLHS sampler?

jpn-- commented 5 years ago

The thing about the factory methods is that the "default" constructor is usually the one where the arguments are closest to the attributes that are being set and require the least processing. E.g., a pandas DataFrame can be constructed most directly from data/index/columns, and has factory methods for compiling a dictionary or other things. If I were building this class from scratch, it might be simpler to have a direct constructor that takes a rv_frozen dist argument, and an alternate form that takes (lower, upper) bounds, either as a 2-tuple in the dist argument or as a classmethod factory. But prevent the change from breaking current code, the default constructor needs to be able to still accept lower and upper bounds as arguments.

If you still think having two factory classmethods will be valuable, I can write them. Alternatively, I'll just make changes to the class documentation to explain the arguments, which I don't think is too hard -- see https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.interp.html for an example.

The thinking on the UniformLHS is to provide a simple way to ignore the selected distributional shapes, if that is desired for some kinds of analysis, particularly if the distributions are skewed. E.g. you may want to have distributions for a robust optimization task that should take account of the expected distribution, but want to sample more uniformly across the uncertainty space when searching for worst-case scenarios, as there will be better experimental coverage in the tails where a worst case scenario is more likely to appear.

quaquel commented 5 years ago

Another relevant consideration is API design. In my view, the most common use case of the workbench is exploratory modelling using uniform distributions. Given that this is the most common use case, it makes sense to support this with the default init.

Documenting attributes of a class is done separately (see also the numpy doc standard).

I have to think about this some more. Leave the PR for now. I have a conference next week where I hope to have time to look at it again more closely. I might be able to find a compromise between your preferred approach, and my current thinking about using factory methods. It might for example be possible to use a meta class to resolve the issue.