[ENH] sktime integration?

fkiraly commented 8 months ago

Very nice package, stringently designed!

I was wondering whether you were thinking about sktime integration?

there are a lot of strategies for portfolio optimization that are based on forecasts, and sktime has native support fo hierarchical data, pairwise distance transformers for time series, etc.
further, I see you are looking into uncertainty set estimates, including cyclic boosting, which would integrate well with the companion package skpro.
thinking longer ahead, combining covariance estimators, forecasters, etc into a strategy would likely become compositional across packages. For this, we have built scikit-base, which is similar to your "import BaseEstimator" but imo more architecturally stringent. Especially if you want unified interface tests that go beyond sklearn check_estimator.

Opening an issue since I'm not quite sure what the best way is to get in touch.

HugoDelatte commented 8 months ago

Thank you! Likewise, both sktime and skpro are amazing.

That's definitely something we need to explore. I'll start working on examples that use sktime estimators to see how it fits with skfolio. We may need to modify our approach to duck typing in order to improve our integration of other scikit-learn based packages.

I'll also need to better understand the subtleties between the benefits of using scikit-base vs BaseEstimator. If you have some docs on this topic it would be very helpful otherwise I'll deep dive into sktime implementation.

fkiraly commented 8 months ago

If you have some docs on this topic it would be very helpful otherwise I'll deep dive into sktime implementation.

Of course! We should probably get around sometime to link these from the repo.

video: skbase intro at pydata Seattle 2023

Repository for the tutorial: https://github.com/sktime/sktime-tutorial-pydata-seattle-2023

The key differences are:

non-difference: it's designed to be fully sklearn compatible.
scikit-base depends on nothing! So you don't get the indirect dependencies of sklearn if you don't need them, e.g., you really want only BaseEstimator (or BaseEstimator plus 😄 )
you can represent objects without fit
you can manage dependencies on estimator level, they work like small packages (see sktime, it really helps minimize deps on framework level! compare sktime deps to long list of soft deps used in testing)
there is no historical baggage around tags and configs like the maddening multiple tag systems in scikit-learn, instead there is a stringent set_tags, get_tags, set_config, get_config, with inheritance by key
you get get_fitted_params which works like get_params (with nesting), only for fitted params ending in underscore. This works for nested estimators too, even if they contain sklearn estimators!
you get a templatable testing system that allows you to write your own check_estimator very easily, check out TestAllForecasters or TestAllTransformers in sktime, and sktime's check_estimator

Philosophically, it is designed to enable interoperability and, most importantly, composability, of a decentrally managed system of scikit-learn-like (and compatible) packages.

fkiraly commented 8 months ago

We may need to modify our approach to duck typing in order to improve our integration of other scikit-learn based packages.

Could you kindly explain that? What is your "approach to duck typing"?

HugoDelatte commented 8 months ago

Thank you for the resources!

Regarding our current design approach, I'll provide some examples and list the pros and cons I've identified.

Let's consider the following example, where we build a Minimum Variance model using a Denoising estimator for the covariance matrix:

from skfolio.datasets import load_sp500_dataset
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns
from skfolio.prior import EmpiricalPrior
from skfolio.moments import DenoiseCovariance
from skfolio.distance import PearsonDistance

prices = load_sp500_dataset()
X = prices_to_returns(prices)

model = MeanRisk(
    prior_estimator=EmpiricalPrior(covariance_estimator=DenoiseCovariance())
)
model.fit(X)

If we make a mistake and use the correlation estimator PearsonDistance instead of a covariance estimator, the following happens:

The IDE will highlight it, indicating that the expected type should inherit from BaseCovariance, but got PearsonDistance instead.
If you run a linter check (such as ruff), it will raise a type hint error.
If you execute the code, it will raise a TypeError: Expected type <class 'skfolio.moments.covariance._base.BaseCovariance'>, got <class 'skfolio.distance._distance.PearsonDistance'>

The first two happen because we implemented type hinting on the base covariance object: covariance_estimator: BaseCovariance. The last happens because we implemented a type check in check_estimator (using isinstance to also accept inherited objects).

This means that an user can still use a custom covariance estimator, but he will need to inherit it from BaseCovariance. The main benefits are increased code explicicity and reduced risk of late error discovery. The drawback is obviously less flexibility and the impossibility to use third-party estimators directly (the user will need to create a custom estimator inheriting from the Base class and the third party estimator).

Another example is the clustering_estimator parameter of NestedClustersOptimization here. The type check is on the scikit-learn's BaseEstimator. This means that it's compatible with all skfolio and sklearn clustering estimators (because they all inherit from sklearn's BaseEstimator). However, it is not directly compatible with sktime, which inherits from skbase's BaseEstimator.

Let's keep this discussion going to explore potential improvements.

fkiraly commented 8 months ago

I'll provide some examples and list the pros and cons I've identified.

Apologies, I was expecting there to be pros/cons in a later post, but I now guess I understand that they are meant to be in the text above?

"major" thoughts about the design:

using inheritance for type definition - agreed mosty. in sktime we also using inheritance from base classes like sklearn. The reason for that is the boilerplate layer, which is provided by the base class.
- However, a subtle point, this is not strongly used in type checking. Why: imagine polymorphic objects, such as pipelines. The abstract type depends on which components you plug together. Inheritance polymorphism is not very pythonic and (imho) an impossibility without hacking the language.
type checks in __init__ - agreed. type checking, if done, is also done in the __init__. That is a departure from older sklearn designs, but newer sklearn also is moving in this direction.
- agreed that a user would already like errors here
- however, a subtle point: imagine you are using grid search on the outside, and don't want to set a parameter for the "tempate", so you set None or simiar as a placeholder. This is not possible with the "check at construction" approach.
compatibility with sktime - sounds like a partial misunderstanding. sktime does not introduce new python types for sklearn abstract types like clusterers etc, it introduces new types for new abstract types, e.g., time series clusterers.
- insofar you are using sklearn - tabular - estimators for time series tasks, I would argue that you are "abusing" the tabular interfaces for time series data.
- and from the design side, I would argue that one should indeed have distinct estimator types, as sklearn explicitly considers those tasks out of scope, and is not expressive enough - the most important example for you, perhaps, forecasting
- either way, type checks work very simiarly - you can use inheritance, which will cover most estimators - however, not polymorphic estimators like pipelines.

microprediction commented 8 months ago

Upvote. Maybe I'm missing something but what in the design prevents one writing a CovarianceEstimator that (say) uses skpro for variance or std forecasts (directly or factors or resids or whatever)? I don't see any blocker.

Though my question to Franz in this context is whether skpro intends to support .partial_fit() for some methods.

fkiraly commented 7 months ago

Though my question to Franz in this context is whether skpro intends to support .partial_fit() for some methods.

If you are considering time series streams where new data becomes available regularly, then you would use update of sktime proba forecasters for that - is that what you mean, @microprediction?

We had some discussions on whether partial_fit was the right name for the method, and we thought that it was conceptually different from stream update of a model.

HugoDelatte commented 7 months ago

@fkiraly thank you for the detailed design thoughts. As you mentioned, we may need to relax type checking when using polymorphic estimators. I'll work on concrete use cases to find the optimal trade-off for the library.

@microprediction, indeed, nothing stops us from creating a custom CovarianceEstimator that uses sktime or skpro. Below is a random example that uses sktime GARCH forecast:

import numpy as np
from sktime.forecasting.arch import StatsForecastGARCH
from skfolio.datasets import load_sp500_dataset
from skfolio.moments import BaseCovariance
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns
from skfolio.prior import EmpiricalPrior

class MyCustomCovariance(BaseCovariance):
    def __init__(self, p: int = 1, q: int = 1, nearest: bool = True):
        super().__init__(nearest=nearest)
        self.p = p
        self.q = q

    def fit(self, X, y=None):
        X.index = X.index.to_period(freq="d")
        forecaster = StatsForecastGARCH(p=self.p, q=self.q)
        forecaster.fit(X)
        pred = forecaster.predict(fh=np.arange(30))
        covariance = np.cov(pred.T)
        self._set_covariance(covariance)
        return self

prices = load_sp500_dataset()
X = prices_to_returns(prices)

model = MeanRisk(
    prior_estimator=EmpiricalPrior(covariance_estimator=MyCustomCovariance())
)
model.fit(X)
print(model.weights_)

skfolio / skfolio

[ENH] sktime integration? #22