Open TNTran92 opened 2 years ago
suggestion: let´s be more specific about what "apply" means. E.g., would one expect a wrapper function like benchmark(estimators, metrics, data, config)
that returns results or dataframe?
suggestion: let´s be more specific about what "apply" means. E.g., would one expect a wrapper function like
benchmark(estimators, metrics, data, config)
that returns results or dataframe?
I've modified the title to be more specific. Will keep adding comments throughout the day
suggestion: let´s be more specific about what "apply" means. E.g., would one expect a wrapper function like
benchmark(estimators, metrics, data, config)
that returns results or dataframe?
From vanilla kotsu, users should expect 1/ model.registration(estimator1, estimator2, etc) 2/ Validation.registration(dataset, cv_settings, metrics) 3/ a .csv in the same folder as notebook showing results
Note to self Tested all estimators for point forecasting. 3 estimators don't work. Those are
Note to self Tested all estimators for point forecasting. 3 estimators don't work. Those are
* [ ] AutoARIMA * [ ] Prophet * [ ] VAR
I can help look into what's causing the issues for those estimates if help wanted :+1:
I've looked into kotsu
and the draft PR https://github.com/alan-turing-institute/sktime/pull/2977 .
I think overall it is great, in simplifying functionality!
One design point that bothers me, though, is the kotsu
design which separates so-called "entry points" (factories or generators) from their kwargs, which in the case of sktime
would mean separating classes from their hyper-parameter settings (objects cannot be passed, since the factories are called with the kwargs, and objects in general do not have a __call__
defined).
That is design inconsistent with sktime
(and sklearn
) where objects are used as "blueprints", especially composites.
Classes, in isolation, cannot specify runtime composites.
Do I understand this correctly?
So, we may need to think how we can best deal with objects being used as "blueprints", along the lines of the sklearn
design.
Of course you could hack it, by splitting the object into its class , used as entry point, and extracting kwargs by get_params(deep=False)
, but that seems like an interface patch rather than a principled approach.
Another way might be to invoke kotsu
- or, rewrite kotsu
, if not currently possible - that objects can be passed directly as entry points, i.e., the result of a current entry_point(**kwargs)
rather than separately passing entry_point
and kwargs
.
Hey @fkiraly (and @TNTran92 feel free to jump in also) - we've been considering the comment above, as well as the discussions we had at the dev days regarding this point.
We see there being a strong reason not to pass or use a single object/instance within benchmarking, because estimator objects can and do often carry a lot of state - one of kotsu's goals with to help users avoid common pitfalls in benchmarking, one of those common pitfalls being information leakage between validation runs. As such pitfalls cause a critical failure to any fair benchmarking attempt and are hard to identify, kotsu tries to do what it can to avoid such occurrences, hence the interface demands a constructor/entry point which returns a fresh instance of an estimator.
Our suggested solution, which we think gets the best of both scenarios is to pass estimator's clone
functions as the entry point. This will(/should) ensure that every time the estimator is used in validation, it is a fresh instance, while still allowing users of the benchmarking module to register an estimator instance in a convenient way. We would still advise users to write factories for the models once they know they want that model to be a benchmarking example/candidate. Usage of the kwargs
arg is optional - we often wrap are best competing model into their own factory function which includes their parameters hardcoded within the function, that way when we want to use it in production we can't get any of the winning model's config incorrect.
I think this might indicate a possibly deeper difference in the way we are thinking about benchmarking - do correct me if I'm mistaken in any/all of this - namely that we're thinking about benchmarking as a repeatable set of experiments and their records, and on the other side there's thinking about benchmarking as an interactive explorative process (i.e. in an interactive python/jupyter session). This could be exemplified by a) the request to have benchmarking results returned directly as an in memory dataframe vs persisting the results to disk, and b) wanting to pass estimator instances possibly created within an interactive session vs wanting to pass estimator factories/constructors/entry_points that are written and organised into modules. Does any of this sound likely/reasonable?
There is another consideration against registering estimator instances (vs a factory or constructor for that estimator), which is likely the reason gym.env wrote the interface in this way originally, which is that with increasing in size estimators and increasing in number estimators one wants to benchmark one has an increasing amount of objects hanging around in memory. This is likely fine for more classic estimators, but for instance the recent push towards implementing deep learning models in sktime - I would be concerned about registering many differing instances of such models given the memory requirements of such large models. We're using kotsu for benchmarking models in the bioinformatics space and we couldn't/wouldn't want to load multiple versions of such models at once on our compute cluster.
Possible next steps (if #2977 merged):
@TNTran92 any of these you want to go for? Classification benchmark implementation maybe?
@DBCerigo, ok, let me try to answer to your long comment. Ahead of everything, I want to say that things are quite subtle, and I think at the same time that:
The subtlety comes from separating concepts and how they are linked to the common sklearn-like designs. We should separate three concepts and their python counterparts, as they commonly appear in those designs:
(a) an abstract specification/blueprint, of an algorithm/estimator, and named parameters that can be set (without values set) (b) an abstract specification/blueprint, of an algorithm/estimator as in (a), plus specific values chosen for named parameters (c) a concrete algorithm/estimator which has ingested data and has an internal model state, after specification as in (b)
If I may try to rephrase your concern: you are saying that (c) is not good to have as input to benchmarking, since it can carry state and leakage. I fully agree to that. You further conclude that, therefore, we shouhld not have objects as inputs. I disagree with that. Reason being how (a) to (c) map onto python objects, as follows:
(a) = sklearn type class (b) = sklearn type object, post-init, with parameters set, but not in "fitted" state (c) = sklearn type object, in "fitted" state
So, the possibly confusing thing is that sklearn objects may represent a "pure specification" (b), but also a highly stateful object (c). What separates "specification" from "stateful" is not the class/object boundary, but the fitted/unfitted boundary.
Specifications are fine, and desired as input to benchmarking, as conceptually they are the medium of reproducibility and repelicability. So, (a) and (b) are fine and desired. (c) is not.
Coming back to the argumentation: I think you missed that, conceptually, the (b) type objects are "specification-like", and were thinking that all objects (as opposed to classes) were "non-specification-like".
So, as a design consequence, my preferred design has these properties:
The type change diagram looks like this:
__init__
: (a) -> (b), new object
fit
: (b) -> (c), same object
reset
: (c) -> (b), same object
clone
: (c) -> (b), new object
type
: (c), (b) -> (a), new object (unique class)
Re memory use: (b) type objects typically occupy minimal memory, since all they contain is the information of "which class" and "which parameter values". (c) type objects can have arbitrary size, and have the problem that you describe (but I don't want those, see above).
PS: since you are referencing gym.env
: the "sklearn-like" counterpart for reinforcement learning, with the design as above, is stable-baselines3
which provides algorithms at a unified interface.
sktime
examples for (a), (b), (c):
from sktime.datasets import load_airline
from sktime.forecasting.naive import NaiveForecaster # NaiveForecaster is type (a)
y = load_airline()
forecaster = NaiveForecaster(strategy="drift") # forecaster object is type (b)
forecaster.fit(y) # after fitting, it changes into type (c)
blueprint = forecaster.clone() # after calling clone on an object, it produces a different object of type (b)
forecaster.reset() # alternatively, call reset, which reverts forecaster itself to type (b)
Possible next steps (if #2977 merged):
* implement storing estimator predictions * implement computing metrics and cross-estimator analyses based on stored predictions * implement other types of benchmarks: * Classification * others? * M4 * Bakeoff * Benchmarking docs need updating for this new functionality?
@TNTran92 any of these you want to go for? Classification benchmark implementation maybe?
Sure, I will take the classification one
@fkiraly thanks for the precise and thorough putting across of your thoughts. They're very helpful.
We discussed and had a think through some solutions. The most compelling argument for us is that original thrust and attempt for the implementation which was "interface to use only sktime/sklearn objects and interfaces", and we think we've found a good solution which is inline with your preferences for the interface and achieves that original goal further.
I've implemented our proposed change: see commit [ENH] Make benchmarking interface take instances
in #2977
The summary is:
estimator_entrypoint: Type[BaseEstimator]
to estimator: BaseEstimator
estimator_kwargs
entirelyestimator.clone
to kotsu as the entry point it will useWe believe this achieves the following:
()
We look forward to hearing what you think. Thanks again for pushing on it - we think it is much improved for it. So of course do push again if/as you think it has further to go.
[edit: change commit link to PR link with commit message, as more stable reference]
@DBCerigo, I looked at the changes - I'd be happy with this, nice!
Disallowing (a) is one of many sensible choices imo, because there might not be a clear way to get defaults for parameters that do not have them. create_test_instance
is one, but that's not necessarily the "best default", only a test default, and might be confusing for the user.
Minor thing, I think there are multiple PR (three total?) related to kotsu
now, would you mind going through them and cleaning up? Either closing, or chunking up, in any case just quickly looking at them whether they are still all needed and active?
I was looking at it and really got confused when trying to review, especially since your linked change above does not seem included yet.
@DBCerigo, I looked at the chance - I'd be happy with this, nice!
Great!
Disallowing (a) is one of many sensible choices imo, because there might not be a clear way to get defaults for parameters that do not have them.
create_test_instance
is one, but that's not necessarily the "best default", only a test default, and might be confusing for the user.
Yep, great.
Minor thing, I think there are multiple PR (three total?) related to
kotsu
now, would you mind going through them and cleaning up? Either closing, or chunking up, in any case just quickly looking at them whether they are still all needed and active?
I've closed https://github.com/datavaluepeople/sktime/pull/1 now. Another PR is https://github.com/alan-turing-institute/sktime/pull/2805, which I can't close, but presuming @TNTran92 or you can if deemed to. The one deemed active from my point of view is https://github.com/alan-turing-institute/sktime/pull/2977
I was looking at it and really got confused when trying to review, especially since your linked change above does not seem included yet.
Soz for that! Was all my fault - I linked the wrong commit (classic add the link while drafting the message, then think oh I want to change some code, rebase, and then forget to update the link), sorry for adding to the confusion! I updated the message to hopefully avoid same for others.
I have closed #2805
Possible next steps (if #2977 merged):
* implement storing estimator predictions * implement computing metrics and cross-estimator analyses based on stored predictions * implement other types of benchmarks: * Classification * others? * M4 * Bakeoff * Benchmarking docs need updating for this new functionality?
@TNTran92 any of these you want to go for? Classification benchmark implementation maybe?
@DBCerigo I suggest adding M5 to the dataset. Since M4 was assigned to you(#2989), can I take M5?
@TNTran92 sure thing. I haven't look properly as what adding M4/M5 would entail nor how I think it could/should interact with benchmarking
but defs dive in if you have the time. I'll start looking into it after #2977 is merged and I've reassessed what I think is best to work on next.
Is your feature request related to a problem? Please describe. Reference Issue #2777 Kotsu is a high level and light-weight package that performs benchmarking that has no dependency on specific packages. At a high level, it allows users to register a dataset with a validation task (CV splitter method) and then apply estimators to that registration. It will be beneficial to investigate its application to sktime's benchmarking.
Describe the solution you'd like Input: estimator, dataset, cross validation splitter Output: a Dataframe showing result per estimator per dataset *: In this notebook, the result is returned as a .csv saved to the same folder as the notebook
Describe alternatives you've considered N/A
Additional context N/A