skfolio / skfolio

Python library for portfolio optimization built on top of scikit-learn
https://skfolio.org
BSD 3-Clause "New" or "Revised" License
1.16k stars 94 forks source link

[ENH] sktime integration? #22

Open fkiraly opened 8 months ago

fkiraly commented 8 months ago

Very nice package, stringently designed!

I was wondering whether you were thinking about sktime integration?

Opening an issue since I'm not quite sure what the best way is to get in touch.

HugoDelatte commented 8 months ago

Thank you! Likewise, both sktime and skpro are amazing.

That's definitely something we need to explore. I'll start working on examples that use sktime estimators to see how it fits with skfolio. We may need to modify our approach to duck typing in order to improve our integration of other scikit-learn based packages.

I'll also need to better understand the subtleties between the benefits of using scikit-base vs BaseEstimator. If you have some docs on this topic it would be very helpful otherwise I'll deep dive into sktime implementation.

fkiraly commented 8 months ago

If you have some docs on this topic it would be very helpful otherwise I'll deep dive into sktime implementation.

Of course! We should probably get around sometime to link these from the repo.

video: skbase intro at pydata Seattle 2023

Repository for the tutorial: https://github.com/sktime/sktime-tutorial-pydata-seattle-2023

The key differences are:

Philosophically, it is designed to enable interoperability and, most importantly, composability, of a decentrally managed system of scikit-learn-like (and compatible) packages.

fkiraly commented 8 months ago

We may need to modify our approach to duck typing in order to improve our integration of other scikit-learn based packages.

Could you kindly explain that? What is your "approach to duck typing"?

HugoDelatte commented 8 months ago

Thank you for the resources!

Regarding our current design approach, I'll provide some examples and list the pros and cons I've identified.

Let's consider the following example, where we build a Minimum Variance model using a Denoising estimator for the covariance matrix:

from skfolio.datasets import load_sp500_dataset
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns
from skfolio.prior import EmpiricalPrior
from skfolio.moments import DenoiseCovariance
from skfolio.distance import PearsonDistance

prices = load_sp500_dataset()
X = prices_to_returns(prices)

model = MeanRisk(
    prior_estimator=EmpiricalPrior(covariance_estimator=DenoiseCovariance())
)
model.fit(X)

If we make a mistake and use the correlation estimator PearsonDistance instead of a covariance estimator, the following happens:

The first two happen because we implemented type hinting on the base covariance object: covariance_estimator: BaseCovariance. The last happens because we implemented a type check in check_estimator (using isinstance to also accept inherited objects).

This means that an user can still use a custom covariance estimator, but he will need to inherit it from BaseCovariance. The main benefits are increased code explicicity and reduced risk of late error discovery. The drawback is obviously less flexibility and the impossibility to use third-party estimators directly (the user will need to create a custom estimator inheriting from the Base class and the third party estimator).

Another example is the clustering_estimator parameter of NestedClustersOptimization here. The type check is on the scikit-learn's BaseEstimator. This means that it's compatible with all skfolio and sklearn clustering estimators (because they all inherit from sklearn's BaseEstimator). However, it is not directly compatible with sktime, which inherits from skbase's BaseEstimator.

Let's keep this discussion going to explore potential improvements.

fkiraly commented 8 months ago

I'll provide some examples and list the pros and cons I've identified.

Apologies, I was expecting there to be pros/cons in a later post, but I now guess I understand that they are meant to be in the text above?

"major" thoughts about the design:

microprediction commented 8 months ago

Upvote. Maybe I'm missing something but what in the design prevents one writing a CovarianceEstimator that (say) uses skpro for variance or std forecasts (directly or factors or resids or whatever)? I don't see any blocker.

Though my question to Franz in this context is whether skpro intends to support .partial_fit() for some methods.

fkiraly commented 7 months ago

Though my question to Franz in this context is whether skpro intends to support .partial_fit() for some methods.

If you are considering time series streams where new data becomes available regularly, then you would use update of sktime proba forecasters for that - is that what you mean, @microprediction?

We had some discussions on whether partial_fit was the right name for the method, and we thought that it was conceptually different from stream update of a model.

HugoDelatte commented 7 months ago

@fkiraly thank you for the detailed design thoughts. As you mentioned, we may need to relax type checking when using polymorphic estimators. I'll work on concrete use cases to find the optimal trade-off for the library.

@microprediction, indeed, nothing stops us from creating a custom CovarianceEstimator that uses sktime or skpro. Below is a random example that uses sktime GARCH forecast:

import numpy as np
from sktime.forecasting.arch import StatsForecastGARCH
from skfolio.datasets import load_sp500_dataset
from skfolio.moments import BaseCovariance
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns
from skfolio.prior import EmpiricalPrior

class MyCustomCovariance(BaseCovariance):
    def __init__(self, p: int = 1, q: int = 1, nearest: bool = True):
        super().__init__(nearest=nearest)
        self.p = p
        self.q = q

    def fit(self, X, y=None):
        X.index = X.index.to_period(freq="d")
        forecaster = StatsForecastGARCH(p=self.p, q=self.q)
        forecaster.fit(X)
        pred = forecaster.predict(fh=np.arange(30))
        covariance = np.cov(pred.T)
        self._set_covariance(covariance)
        return self

prices = load_sp500_dataset()
X = prices_to_returns(prices)

model = MeanRisk(
    prior_estimator=EmpiricalPrior(covariance_estimator=MyCustomCovariance())
)
model.fit(X)
print(model.weights_)