Closed jesellier closed 1 year ago
Distribution :
The v2 Introduce an OOD implementation of the distribution class. The 'DistributionBase' class serves as the main base abstract for all distribution concrete class.
1. Vectorization and Parameters access : The vectorization functionality are maintain i.e. a single distribution object can be vectorized within a single object instanciation. A list of parameters is passed and kept as member. The subset of the vector can be accessed through a 'getitem' implementation as before. Some new parameter accessor have been added :
-> get_params(index) : override the scikit-learn 'get_params' implementation : 1.to get an indexed version of it if needed (since the distribution are here vectorized). 2. for the deep functionality to also works with list of estimators (and not just simple estimators) which is needed for mixture
-> get_param(key) : return a list of the keyed parameter ('string') accross the distributions. If the distribution is vectorized it will only operates for the distribution subset indexed by 'cached_index' (used for some mode of evaluation see below)
2. Additional component added to characterize the distribution type: -> distType enum : Discrete, Continuous, Mixed) -> Support class : object that return a boolean for a given scalar - by default instantiate a NullSupport that always return TRUE (see in distributions->component->support) They are passed by composite at the base class level.
3. Evaluation functions Interface (PDF/CDF) : The distribution evaluation methods implementation rely on a implicit interface design, to separate some processing/checking common to all distributions from the more specific implementation details.
The basic interface methods (pdf/pmf/cdf) are the ones visible and used by the the user. Each of these first perform a series of checks i.e. : 1. assess the validity of the distribution type for the function considered (ex. pdf should only accept continuous or mixed distribution), 2. ensure that the argument is within the support range, or 3. manage the mode of evaluation ([BATCH, ELEMENT_WISE] - describe later).
Then they call some abstracted implied methods (pdf_impl/pmf_impl/cdf_impl) that must contains the implementation details. They are declared abstract in the DPQRMixin and must be overriden in every sub-concrete class. If no concrete override is implemented, the DPQRMixin implementation raise an error by default.
4. Mode mechanism : The existing mode of evaluation mechanism (i.e. [BATCH] or [ELEMENT_WISE] have bee adapted). Every concrete evaluation methods are assumed to be implemented as BATCH mode. The parameters arguments must be called using the private method 'get_param' declared in the distribution base class, that returns the vectorized parameters sliced by a private 'cached_index' member. By default 'cached_index' is set (and reset) to 'slice(None)' which corresponds to a BATCH evaluation where all distributions are evaluated.
The ELEMENT_WISE mode (if activated) decorates the implied methods with 'elementWiseDecorator' . It only evaluates each distribution to a corresponding sample subset (according to the element-wise rule) by modyfying the cach-index before calling the parameters accessor. The results are then aggregated back into a result list
The 'Mode' of a distribution object can be accessed or changed by the user using the following methods:
-> getMode() output the current active Mode -> setMode() reset the current mode to (). Accept a 'Mode' enum argument [Mode.ELEMENT_WISE, Mode.BARCH]
5. Concrete class implementation : a list of concrete class have been implemented.
PS : Some implementation can be visible in the test (tests->distributution) or in the notebooks (notebboks->skpro_distribution_description and notebooks->skpro_distribution_interface)
Losses
The refactoring introduce a OOD implementation for the losses as well.
1. Probabilistic losses as functor object : The logLoss, a ClippedLogLoss and the integratedLogLoss have been added.
2. ProbabilisticScorer : Similar to scikit-learn scorer, It acts as a wrap-up class that directly evaluates an error given a loss functor and a probabilistic estimator.
PS : A default score method (directly instantiating and returning the scorer) can be associated to some ProbabilisticEstimator object by inheriting the mixin DefaultScoreMixin (in skpro->estimators->score - set to logLoss).
3. ModelManager for Scoring : A model Manager has been added to compute the scores or cross_val_score of a vector of model in a probabilistic setting (see skpro->cross_validation):
PS: some test can be found in tests->test_validation
Estimators
Probabilistic Estimators Design : All probabilisic estimators inherits from a common base asbstract class "ProbabilisticEstimators" (that itself inherits from scikit learn BaseEstimator). The base class specify the methods 'predict' (that return a user define point estimate associated with the distribution of the probabilistic prediction) and 'predict_proba' (that return the predictive distribution directly).
The concrete '2-step' probabilistic estimator It has been segregated into three different inter-connected sub-estimators :
The main ParametricEstimator is initialized with a classical 'mean_estimator' and a classical 'ResidualEstimator' object as member (eventualy wrapped-up by a ClippedEstimator). Since the residual estimator must use the fitted mean_estimator for its own fitting, the two must be linked : The mean_estimator is thus also shared/passed to the residual estimator within the 'ParametricEstimator' init method scope. It is done through a 'call' overload method (acting as a set method) declared in the abstract base of all residual estimators. Then the Parametric 'fit' method works by fiting the mean estimator and then the residual one (that is thus making use of the mean one already fitted)
The losses and the distribution type are finally defined in a modular way :
Others
The other parts of the code (Bagging, Baselines, Empirical densities) have been adapted to the change :
For Bagging : A probabilistic estimator 'ProbabilisticBaggingRegressor' have been added in (skpro->ensemble) with its specific 'BaggedDistribution' predict return type.
The refactored Empirical Density wrap-up estimator can be found in (skpro->baseline->density_baseline) base line.
makes a lot of sense. I have no major comments or criticism at this point in time. (except typos: you mean oop, not ood, right?)
As a proposal/issue content, this is also very nicely worked out.
I would hence suggest to create a wiki, and put all the above in a new page, skpro v2 design documents or similar, so it doesn´t get forgotten once this issue is closed.
Hi @jesellier,
Nice proposal, thanks for writing this up. This seems like a natural evolution of skpro.
I like the idea to make the loss functions callable objects. Once the abstract interface is in place it might be worth to revisit issue #2?
I have three minor comments on the distribution diagram:
1) In the DPQRMixin you proposed a implied method, which seems to be the method that implements the actual method, is that right? If these methods are implementation only and should not be called by users, I'd suggest to make it private, e.g. _pdf()
etc. That would avoid the name clash with the public pdf() so there would be no need for the _imp
suffix.
2) Correct me if I'm wrong but it looks like the DPQRMixin contains no concrete implementation but is just an abstract interface description. If so, why is it a Mixin and not just a direct declaration in the DistributionBaseClass? Or are you planning to provide different concrete implemented Mixins that inherit from the interface and that users can include based on their needs?
3) We once discussed in #3 whether point
makes sense or it should just be mean
to avoid confusion. Might be worth renaming that during the refactoring.
Let me know if I can help with anything. I'm happy to contribute or review.
Hello @frthjf
Thanks for you comment. To answer your points :
It's a valid one, it should indeed be private. I can easily add it. For the suffix 'impl' it is actually a common one that is used when separating interface to implementation (at least in C++ or where I worked). I just naturally picked it up.
It can equally be put directly into the distribution base. I just created the Mixin to alleviate the base class content. Since as such it is just a group of similar pure abstract methods with no real information conveyed (instead of all in one file).
Both should be OK. But I can see some advantage in using the 'point' naming convention. It gives the user (at no cost) the flexibility to choose what the point prediction be for each distribution (that might be other than the mean for some). An alternative can be to get a concrete point in the BasicStatsMixin that return the abstract 'mean' (i.e. default implementation). Then 'point' can be overriden in every distribution subclass if the point prediction is not the 'mean'. As such we get the best of both world (flexibility and not having to specifiy the point for every distribution class).
Actually, if you have time, I can send you a code review request before comitting. That would be really helpful. We can also organize a discussion for future roadmap.
Thanks!
np.mean
(see #3). However, it's still a valid choice, so I'm fine with using point(). Following this logic, should we also use deviation
as abstract and std
as concrete implementation?Yes, I'd be happy to review and discuss if that's helpful.
Re 3 - One could give distributions a location
and scale
method, which are distribution-specific, besides mathematically defined statistics such a mean
, median
, variance
, etc.
I would not call anything point
since it implies a prediction, which is not a property of the distribution object itself.
It is interesting to think about what the default should be, in reduction (or probabilistic to deterministic). I thought that should be the mean
(since predictive conditional mean minimizes MSE), but one could also choose it to be location
for semantics, or median
for consistency and for making sure it always exists.
I think having location
/scale
as generic and semantically vague default and optional derived mean
/variance
that may or may not exist is a good idea.
But not all distributions can be fully parametrized by two location
/scale
parameters. Not sure it should be pushed to the distribution base class level.
And actually I have been using a mixin class 'LocationScaleMixin' (see distributions->location_scale.py) for the location-scale family distribution (i.e. normal and laplace so far that are both instantiated with loc and scale parameters). Right now it is mainly use for the '2 step estimator' : 1. To ensure that the pre-specified returned distribution type is indeed a location-scale one (i.e. inheriting from this mixin). 2. To convert the std
estimate (from the residuals) into a scale
parameter (one to one for normal but different for laplace). It can be developped a bit more (ex. with loc and scale get method declared at this mixin level).
For the point
method, an alternative might be to simply remove it here. The point prediction type can be directly encoded in the ProbabilisticEstimator base class (i.e. predict method). We can imagine passing a string that specifiy the return type ('mean, median, loc') - then condition apply depending of the distribution (since not all might have a loc method for exemple). That's just a proposition.
I think removing would weaken the purpose of the distribution interface. After all, you want to type my_distribution.
and have your IDE autocomplete my_distribution.mean()
for you. How about we just define all of them as
def location():
raise NotImplementedError()
def mean():
raise NotImplementedError()
...
and let the user decide which one to implement. When interacting with the distribution we can just operate on it in pythonic EAFP style:
try:
return dist.mean()
except NotImplementedError:
return dist.location()
It's probably best to also introduce a NotExistingError() for the user to indicate that the mean actually does not exist.
But not all distributions can be fully parametrized by two location/scale parameters. Not sure it should be pushed to the distribution base class level.
Of course, expecially mixtures and such. If we go with location/scale methods, I would still strongly argue to have the actual parameterization in the hyper-parameters, and location/scale not necessarily reflecting a complete parameterization.
Though maybe location/scale methods are not a good idea after all? Since not all distributions will have one that is well-defined (e.g., what even is that for a mixture).
After all, you want to type my_distribution. and have your IDE autocomplete my_distribution.mean() for you.
How would that auto-complete even work, @frthjf? Based on what would one compute the mean?
It's probably best to also introduce a NotExistingError() for the user to indicate that the mean actually does not exist.
Yes, good idea to distinguish this, or more precisely between:
How would that auto-complete even work, @frthjf? Based on what would one compute the mean?
If the method is defined by the base class, it would be autocompleted via __dict__
lookup (at least in jupyter notebooks). If you can't compute the mean because the distribution does not have one, the call just triggers the exception and you get an error, simple as that.
Yes, good idea to distinguish this, or more precisely between:
* does not exist (theoretically) * is implemented, and analytically or numerically exact * is implemented, but may not be exact, e.g., by auto-complete or numerical imputation
Good point! That's nice because the errors become automatically descriptive and can be handled properly. For example, it would be very easy to fall back on approximation if analytical solution does not exists, like
from skpro import exceptions as E
try:
dist.mean()
except E.NotExisting:
raise AttributeError('Though luck my friend')
except E.NotExact:
print('warn: falling back on approximation')
approximateMean(dist)
@frthjf actually I had in mind to remove the point
method only not the other ones. This is what I commented above :
class BasicStatsMixin():
@abc.abstractmethod
def mean(self):
raise NotImplementedError()
@abc.abstractmethod
def std(self):
raise NotImplementedError()
[...]
class DistributionBase(BaseEstimator, BasicStatsMixin, DPQRMixin):
[...]
class LocationScaleMixin():
@abc.abstractmethod
def scale(self):
raise NotImplementedError()
@abc.abstractmethod
def loc(self):
raise NotImplementedError()
@classmethod
def varianceToScale(cls, variance):
raise NotImplementedError()
class Normal(DistributionBase, LocationScaleMixin) :
def __init__(self, loc = 0.0, scale = 1.0):
self.loc = loc
self.scale = scale
[...]
def mean(self):
return self.loc
def std(self):
return self.scale
def variance(self):
return np.square(self.scale)
def loc(self):
return self.loc
def scale(self):
return self.scale
@classmethod
def varianceToScale(cls, variance):
return np.sqrt(variance)
Then having a class defined we can restrict the 2 step estimator to them:
class ParametricEstimator(ProbabilisticEstimator):
def __init__(self, mean_estimator, dispersion_estimator,
distribution = Normal(), ...) :
if not isinstance(distribution, LocationScaleMixin):
raise ValueError("Unknown strategy type : expected a location_scale distribution")
self.distribution = distribution
Ok, makes sense! I'm still not fully convinced of the Mixin approach. I prefer a ducktyping approach over explicit declaration, but I think that's just a matter of taste, so happy to go with that.
I´m actually more in favour of explicit typing here, since functionality will be type or property dependent - I do think mixins add a lot here without becoming too unpythonic. This is very similar in flavour to the sklearn RegressorMixin etc, I think.
Actually following our last discussion @fkiraly , I am wondering if it not better to delegate all the distributions to Tensorflow-Probabilities. They already have a full framework in place that seems well done (in functionality and range of existing distributions) and we have then an immediate compatibilities for Neural-Network based estimators (without any need for wrap-ups). It should not be too long to adapt the rest to it.
yes, it looks mature enough - I remember we had this discussin off-line. Especially it supports vectorization which we need.
To recall, I think the problems are:
That´s not a complete "no", but there is a trade-off. Where this kind of discussion is best had in a design phase (rather than after having implemented things...)
I've been working with TFP for a while and I agree it has a good feature set (after all, PyMC switched to it as a backend as well). One big benefit would be that we would get hardware acceleration for free and it would also save a lot of reinventing the wheel. So if the question is whether we should delegate the low-level functionality to Tensorflow, I think that's a good idea. When it comes to the high-level API, however, Tensorflow can get indeed messy very easily. The upside is that there is lots of room for skpro to provide value beyond just being a wrapper if we manage to abstract away some of TF's messiness. I'd be interested to work on this kind of problems although I suspect it's not going to be easy.
Following our discussion by mail, I have coding a skeleton with tensor flow probabilities that seems to work.
Ex : For the base distribution class :
-> inherit from sklearn baseEstimator and tensor flow Distribution
-> pass a tensor flow distribution by composition (member : _dist
)
class supportType(Enum):
UNDEFINED = 0
DISCRETE = 1
CONTINUOUS = 2
MIXED = 3
class outputType(Enum):
ND_ARRAY = 0
TF_TENSOR = 1
class DistributionBase(BaseEstimator, tfd.Distribution):
def __init__(self, dist, stype = supportType.UNDEFINED):
if not isinstance(dist, tfd.Distribution):
raise ValueError('Invalid composite distribution %s for distribution %s. '
'must be of type tensorflow_probability.distribution`.' %
(dist, self))
self._dist = dist
self._stype = stype
self._otype = outputType.ND_ARRAY
def dist(self) :
return self._dist
Then we can have some get_params
and set_params
that acts with/to the _dist
def get_params(self, index = None, deep=True, output_tensor = False):
out = dict()
#duplicate sk.learn distlementation
for key, value in self._dist.parameters.items() :
value = self._convert_to_array(value)
if index is not None \
and not isinstance(value, str) \
and hasattr(value, "__len__") \
and hasattr(value, '__getitem__') :
out[key] = value[index]
else :
out[key] = value
return out
def set_params(self, **params):
if not params:
# Sdistle optimization to gain speed (inspect is slow)
return self
valid_params = self.get_params(deep=True)
for key, value in params.items():
if key not in valid_params:
raise ValueError('Invalid parameter %s for estimator %s. '
'Check the list of available parameters '
'with `estimator.get_params().keys()`.' %
(key, self))
dic = dict(self._dist.parameters, **params)
self._dist = type(self._dist)(**dic)
return self
And I override the tf.Distribution methods :
-> just few below as exemple (there is much more available)
-> I added an automatic output conversion to array (vs tf.tensor type) that is controled by the
self._otype
member
def _convert_to_array(self, arg):
if isinstance(arg, tf.Tensor):
if self._otype is outputType.ND_ARRAY :
arg = arg.numpy()
return arg
# accessors
@property
def name(self):
return self._dist.name
@property
def sample(self, sample_shape=(), seed=None, name='sample', **kwargs):
return self._dist.sample(sample_shape=sample_shape, seed= seed, name= name, **kwargs)
def prob(self, value, name='prob', **kwargs):
out = self._dist.prob(value = value, name= name, **kwargs)
return self._convert_to_array(out)
def cdf(self, value, name='cdf', **kwargs):
out = self._dist.cdf(value = value, name= name, **kwargs)
return self._convert_to_array(out)
def entropy(self, name='entropy', **kwargs):
out = self._dist.entropy(name= name, **kwargs)
return self._convert_to_array(out)
def mean(self, name='mean', **kwargs):
out = self._dist.mean(name= name, **kwargs)
return self._convert_to_array(out)
We can that add Mixins. See below with the normal distributions (add two mixins adding the L2-norm and the Loc-Scal methods):
class DPQRMixin():
@abc.abstractmethod
def squared_norm(self):
raise NotImplementedError()
class LocationScaleMixin():
@property
def loc(self) :
raise NotImplementedError()
@property
def scale(self) :
raise NotImplementedError()
@classmethod
def varianceToScale(cls, variance):
raise NotImplementedError()
class Normal(DistributionBase, DPQRMixin, LocationScaleMixin):
def __init__(self,
loc = 0,
scale = 1,
validate_args= False,
allow_nan_stats= True,
name='Normal'):
dist = tfd.Normal(
loc,
scale,
validate_args,
allow_nan_stats,
name
)
super(Normal, self).__init__(dist, stype = supportType.CONTINUOUS)
@property
def loc(self):
return self._dist.loc
@property
def scale(self):
return self._dist.scale
def squared_norm(self):
scale = tf.convert_to_tensor(self._dist.scale)
sqrt_pi = tf.constant(np.sqrt(np.pi))
out = 1.0 /(tf.math.pow(2.0, scale) * sqrt_pi)
return out
Overall, the code is much lighter for pretty much same functionalities. And we get all tesor flow additional functionalities for free (including AAD).
superseded by https://github.com/sktime/skpro/issues/10
See below some comments/description of the coming refactoring contents :
Some descriptive notebooks (in docs->notebooks) and a full set of unit test (in tests) are also available.