[ENH] Using Kotsu as a benchmarking framework - Implementation in sktime

TNTran92 commented 2 years ago

Is your feature request related to a problem? Please describe. Reference Issue #2777 Kotsu is a high level and light-weight package that performs benchmarking that has no dependency on specific packages. At a high level, it allows users to register a dataset with a validation task (CV splitter method) and then apply estimators to that registration. It will be beneficial to investigate its application to sktime's benchmarking.

Describe the solution you'd like Input: estimator, dataset, cross validation splitter Output: a Dataframe showing result per estimator per dataset *: In this notebook, the result is returned as a .csv saved to the same folder as the notebook

Describe alternatives you've considered N/A

Additional context N/A

fkiraly commented 2 years ago

suggestion: let´s be more specific about what "apply" means. E.g., would one expect a wrapper function like benchmark(estimators, metrics, data, config) that returns results or dataframe?

TNTran92 commented 2 years ago

suggestion: let´s be more specific about what "apply" means. E.g., would one expect a wrapper function like benchmark(estimators, metrics, data, config) that returns results or dataframe?

I've modified the title to be more specific. Will keep adding comments throughout the day

TNTran92 commented 2 years ago

suggestion: let´s be more specific about what "apply" means. E.g., would one expect a wrapper function like benchmark(estimators, metrics, data, config) that returns results or dataframe?

From vanilla kotsu, users should expect 1/ model.registration(estimator1, estimator2, etc) 2/ Validation.registration(dataset, cv_settings, metrics) 3/ a .csv in the same folder as notebook showing results

TNTran92 commented 2 years ago

Note to self Tested all estimators for point forecasting. 3 estimators don't work. Those are

[ ] AutoARIMA
[ ] Prophet
[ ] VAR

DBCerigo commented 2 years ago

Note to self Tested all estimators for point forecasting. 3 estimators don't work. Those are
* [ ]   AutoARIMA

* [ ]  Prophet

* [ ]  VAR

I can help look into what's causing the issues for those estimates if help wanted :+1:

fkiraly commented 2 years ago

I've looked into kotsu and the draft PR https://github.com/alan-turing-institute/sktime/pull/2977 .

I think overall it is great, in simplifying functionality!

One design point that bothers me, though, is the kotsu design which separates so-called "entry points" (factories or generators) from their kwargs, which in the case of sktime would mean separating classes from their hyper-parameter settings (objects cannot be passed, since the factories are called with the kwargs, and objects in general do not have a __call__ defined).

That is design inconsistent with sktime (and sklearn) where objects are used as "blueprints", especially composites. Classes, in isolation, cannot specify runtime composites.

Do I understand this correctly?

So, we may need to think how we can best deal with objects being used as "blueprints", along the lines of the sklearn design.

Of course you could hack it, by splitting the object into its class , used as entry point, and extracting kwargs by get_params(deep=False), but that seems like an interface patch rather than a principled approach.

Another way might be to invoke kotsu - or, rewrite kotsu, if not currently possible - that objects can be passed directly as entry points, i.e., the result of a current entry_point(**kwargs) rather than separately passing entry_point and kwargs.

DBCerigo commented 2 years ago

Hey @fkiraly (and @TNTran92 feel free to jump in also) - we've been considering the comment above, as well as the discussions we had at the dev days regarding this point.

We see there being a strong reason not to pass or use a single object/instance within benchmarking, because estimator objects can and do often carry a lot of state - one of kotsu's goals with to help users avoid common pitfalls in benchmarking, one of those common pitfalls being information leakage between validation runs. As such pitfalls cause a critical failure to any fair benchmarking attempt and are hard to identify, kotsu tries to do what it can to avoid such occurrences, hence the interface demands a constructor/entry point which returns a fresh instance of an estimator.

Our suggested solution, which we think gets the best of both scenarios is to pass estimator's clone functions as the entry point. This will(/should) ensure that every time the estimator is used in validation, it is a fresh instance, while still allowing users of the benchmarking module to register an estimator instance in a convenient way. We would still advise users to write factories for the models once they know they want that model to be a benchmarking example/candidate. Usage of the kwargs arg is optional - we often wrap are best competing model into their own factory function which includes their parameters hardcoded within the function, that way when we want to use it in production we can't get any of the winning model's config incorrect.

I think this might indicate a possibly deeper difference in the way we are thinking about benchmarking - do correct me if I'm mistaken in any/all of this - namely that we're thinking about benchmarking as a repeatable set of experiments and their records, and on the other side there's thinking about benchmarking as an interactive explorative process (i.e. in an interactive python/jupyter session). This could be exemplified by a) the request to have benchmarking results returned directly as an in memory dataframe vs persisting the results to disk, and b) wanting to pass estimator instances possibly created within an interactive session vs wanting to pass estimator factories/constructors/entry_points that are written and organised into modules. Does any of this sound likely/reasonable?

There is another consideration against registering estimator instances (vs a factory or constructor for that estimator), which is likely the reason gym.env wrote the interface in this way originally, which is that with increasing in size estimators and increasing in number estimators one wants to benchmark one has an increasing amount of objects hanging around in memory. This is likely fine for more classic estimators, but for instance the recent push towards implementing deep learning models in sktime - I would be concerned about registering many differing instances of such models given the memory requirements of such large models. We're using kotsu for benchmarking models in the bioinformatics space and we couldn't/wouldn't want to load multiple versions of such models at once on our compute cluster.

DBCerigo commented 2 years ago

Possible next steps (if #2977 merged):

implement storing estimator predictions
implement computing metrics and cross-estimator analyses based on stored predictions
implement other types of benchmarks:
- Classification
- others?
M4
Bakeoff
Benchmarking docs need updating for this new functionality?

@TNTran92 any of these you want to go for? Classification benchmark implementation maybe?

fkiraly commented 2 years ago

@DBCerigo, ok, let me try to answer to your long comment. Ahead of everything, I want to say that things are quite subtle, and I think at the same time that:

benchmarking should be repeatable, reproducible (i.e., there is no "deeper difference in how we think about benchmarking"!)
we should be able to pass initialized objects, not classes
there is no contradiction between the two (I feel you think there is a contradiction here)

The subtlety comes from separating concepts and how they are linked to the common sklearn-like designs. We should separate three concepts and their python counterparts, as they commonly appear in those designs:

(a) an abstract specification/blueprint, of an algorithm/estimator, and named parameters that can be set (without values set) (b) an abstract specification/blueprint, of an algorithm/estimator as in (a), plus specific values chosen for named parameters (c) a concrete algorithm/estimator which has ingested data and has an internal model state, after specification as in (b)

If I may try to rephrase your concern: you are saying that (c) is not good to have as input to benchmarking, since it can carry state and leakage. I fully agree to that. You further conclude that, therefore, we shouhld not have objects as inputs. I disagree with that. Reason being how (a) to (c) map onto python objects, as follows:

(a) = sklearn type class (b) = sklearn type object, post-init, with parameters set, but not in "fitted" state (c) = sklearn type object, in "fitted" state

So, the possibly confusing thing is that sklearn objects may represent a "pure specification" (b), but also a highly stateful object (c). What separates "specification" from "stateful" is not the class/object boundary, but the fitted/unfitted boundary.

Specifications are fine, and desired as input to benchmarking, as conceptually they are the medium of reproducibility and repelicability. So, (a) and (b) are fine and desired. (c) is not.

Coming back to the argumentation: I think you missed that, conceptually, the (b) type objects are "specification-like", and were thinking that all objects (as opposed to classes) were "non-specification-like".

fkiraly commented 2 years ago

So, as a design consequence, my preferred design has these properties:

it should be fine to pass (a) type objects or (b) type objects to benchmarking experiments.
if (a) is passed, it should be initialized with defaults to obtain an object of type (b)
if an object of type (c) is passed, it should be cloned, to obtain an object of type (b)

The type change diagram looks like this:

__init__: (a) -> (b), new object fit: (b) -> (c), same object reset: (c) -> (b), same object clone: (c) -> (b), new object type: (c), (b) -> (a), new object (unique class)

fkiraly commented 2 years ago

Re memory use: (b) type objects typically occupy minimal memory, since all they contain is the information of "which class" and "which parameter values". (c) type objects can have arbitrary size, and have the problem that you describe (but I don't want those, see above).

fkiraly commented 2 years ago

PS: since you are referencing gym.env: the "sklearn-like" counterpart for reinforcement learning, with the design as above, is stable-baselines3 which provides algorithms at a unified interface.

fkiraly commented 2 years ago

sktime examples for (a), (b), (c):

from sktime.datasets import load_airline
from sktime.forecasting.naive import NaiveForecaster  # NaiveForecaster is type (a)

y = load_airline()
forecaster = NaiveForecaster(strategy="drift")  # forecaster object is type (b)
forecaster.fit(y)  # after fitting, it changes into type (c)

blueprint = forecaster.clone()  # after calling clone on an object, it produces a different object of type (b)
forecaster.reset()  # alternatively, call reset, which reverts forecaster itself to type (b)

TNTran92 commented 2 years ago

Possible next steps (if #2977 merged):
* implement storing estimator predictions

* implement computing metrics and cross-estimator analyses based on stored predictions

* implement other types of benchmarks:

  * Classification
  * others?

* M4

* Bakeoff

* Benchmarking docs need updating for this new functionality?
@TNTran92 any of these you want to go for? Classification benchmark implementation maybe?

Sure, I will take the classification one

DBCerigo commented 2 years ago

@fkiraly thanks for the precise and thorough putting across of your thoughts. They're very helpful.

We discussed and had a think through some solutions. The most compelling argument for us is that original thrust and attempt for the implementation which was "interface to use only sktime/sklearn objects and interfaces", and we think we've found a good solution which is inline with your preferences for the interface and achieves that original goal further.

I've implemented our proposed change: see commit [ENH] Make benchmarking interface take instances in #2977 The summary is:

changed estimator_entrypoint: Type[BaseEstimator] to estimator: BaseEstimator
removed estimator_kwargs entirely
we get and pass estimator.clone to kotsu as the entry point it will use

We believe this achieves the following:

achieves "interface to use only sktime/sklearn objects" significantly more successfully
simplifies and makes more intuitive the interface for users
as per your labelling above; enables users to register b) and c), doesn't allow a) to be registered
- note that disallowing a) was a conscious choice - we didn't think the extra functionality was worth the extra complexity/choice for the user or the extra code complexity, when an a) can become a b) with just ()
ensures and protects users from estimators used in benchmarking from leaking information or state between runs and invaliding the benchmarking, while still allowing an easy intuitive interface for them to use

We look forward to hearing what you think. Thanks again for pushing on it - we think it is much improved for it. So of course do push again if/as you think it has further to go.

[edit: change commit link to PR link with commit message, as more stable reference]

fkiraly commented 2 years ago

@DBCerigo, I looked at the changes - I'd be happy with this, nice!

Disallowing (a) is one of many sensible choices imo, because there might not be a clear way to get defaults for parameters that do not have them. create_test_instance is one, but that's not necessarily the "best default", only a test default, and might be confusing for the user.

Minor thing, I think there are multiple PR (three total?) related to kotsu now, would you mind going through them and cleaning up? Either closing, or chunking up, in any case just quickly looking at them whether they are still all needed and active?

I was looking at it and really got confused when trying to review, especially since your linked change above does not seem included yet.

DBCerigo commented 2 years ago

@DBCerigo, I looked at the chance - I'd be happy with this, nice!

Great!

Disallowing (a) is one of many sensible choices imo, because there might not be a clear way to get defaults for parameters that do not have them. create_test_instance is one, but that's not necessarily the "best default", only a test default, and might be confusing for the user.

Yep, great.

Minor thing, I think there are multiple PR (three total?) related to kotsu now, would you mind going through them and cleaning up? Either closing, or chunking up, in any case just quickly looking at them whether they are still all needed and active?

I've closed https://github.com/datavaluepeople/sktime/pull/1 now. Another PR is https://github.com/alan-turing-institute/sktime/pull/2805, which I can't close, but presuming @TNTran92 or you can if deemed to. The one deemed active from my point of view is https://github.com/alan-turing-institute/sktime/pull/2977

I was looking at it and really got confused when trying to review, especially since your linked change above does not seem included yet.

Soz for that! Was all my fault - I linked the wrong commit (classic add the link while drafting the message, then think oh I want to change some code, rebase, and then forget to update the link), sorry for adding to the confusion! I updated the message to hopefully avoid same for others.

TNTran92 commented 2 years ago

I have closed #2805

TNTran92 commented 2 years ago

Possible next steps (if #2977 merged):
* implement storing estimator predictions

* implement computing metrics and cross-estimator analyses based on stored predictions

* implement other types of benchmarks:

  * Classification
  * others?

* M4

* Bakeoff

* Benchmarking docs need updating for this new functionality?
@TNTran92 any of these you want to go for? Classification benchmark implementation maybe?

@DBCerigo I suggest adding M5 to the dataset. Since M4 was assigned to you(#2989), can I take M5?

DBCerigo commented 2 years ago

@TNTran92 sure thing. I haven't look properly as what adding M4/M5 would entail nor how I think it could/should interact with benchmarking but defs dive in if you have the time. I'll start looking into it after #2977 is merged and I've reassessed what I think is best to work on next.

sktime / sktime

[ENH] Using Kotsu as a benchmarking framework - Implementation in sktime #2804