Open fkiraly opened 8 months ago
Thank you! Likewise, both sktime
and skpro
are amazing.
That's definitely something we need to explore.
I'll start working on examples that use sktime
estimators to see how it fits with skfolio
. We may need to modify our approach to duck typing in order to improve our integration of other scikit-learn based packages.
I'll also need to better understand the subtleties between the benefits of using scikit-base
vs BaseEstimator
. If you have some docs on this topic it would be very helpful otherwise I'll deep dive into sktime
implementation.
If you have some
docs
on this topic it would be very helpful otherwise I'll deep dive intosktime
implementation.
Of course! We should probably get around sometime to link these from the repo.
video: skbase intro at pydata Seattle 2023
Repository for the tutorial: https://github.com/sktime/sktime-tutorial-pydata-seattle-2023
The key differences are:
sklearn
compatible.scikit-base
depends on nothing! So you don't get the indirect dependencies of sklearn
if you don't need them, e.g., you really want only BaseEstimator
(or BaseEstimator plus
😄 )fit
sktime
, it really helps minimize deps on framework level! compare sktime
deps to long list of soft deps used in testing)scikit-learn
, instead there is a stringent set_tags
, get_tags
, set_config
, get_config
, with inheritance by keyget_fitted_params
which works like get_params
(with nesting), only for fitted params ending in underscore. This works for nested estimators too, even if they contain sklearn
estimators!check_estimator
very easily, check out TestAllForecasters
or TestAllTransformers
in sktime
, and sktime
's check_estimator
Philosophically, it is designed to enable interoperability and, most importantly, composability, of a decentrally managed system of scikit-learn
-like (and compatible) packages.
We may need to modify our approach to duck typing in order to improve our integration of other scikit-learn based packages.
Could you kindly explain that? What is your "approach to duck typing"?
Thank you for the resources!
Regarding our current design approach, I'll provide some examples and list the pros and cons I've identified.
Let's consider the following example, where we build a Minimum Variance model using a Denoising estimator for the covariance matrix:
from skfolio.datasets import load_sp500_dataset
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns
from skfolio.prior import EmpiricalPrior
from skfolio.moments import DenoiseCovariance
from skfolio.distance import PearsonDistance
prices = load_sp500_dataset()
X = prices_to_returns(prices)
model = MeanRisk(
prior_estimator=EmpiricalPrior(covariance_estimator=DenoiseCovariance())
)
model.fit(X)
If we make a mistake and use the correlation estimator PearsonDistance
instead of a covariance estimator, the following happens:
BaseCovariance
, but got PearsonDistance
instead.TypeError: Expected type <class 'skfolio.moments.covariance._base.BaseCovariance'>, got <class 'skfolio.distance._distance.PearsonDistance'>
The first two happen because we implemented type hinting on the base covariance object: covariance_estimator: BaseCovariance
.
The last happens because we implemented a type check in check_estimator (using isinstance
to also accept inherited objects).
This means that an user can still use a custom covariance estimator, but he will need to inherit it from BaseCovariance
. The main benefits are increased code explicicity and reduced risk of late error discovery. The drawback is obviously less flexibility and the impossibility to use third-party estimators directly (the user will need to create a custom estimator inheriting from the Base class and the third party estimator).
Another example is the clustering_estimator
parameter of NestedClustersOptimization
here. The type check is on the scikit-learn's BaseEstimator
. This means that it's compatible with all skfolio
and sklearn
clustering estimators (because they all inherit from sklearn's BaseEstimator
). However, it is not directly compatible with sktime
, which inherits from skbase
's BaseEstimator
.
Let's keep this discussion going to explore potential improvements.
I'll provide some examples and list the pros and cons I've identified.
Apologies, I was expecting there to be pros/cons in a later post, but I now guess I understand that they are meant to be in the text above?
"major" thoughts about the design:
sktime
we also using inheritance from base classes like sklearn
. The reason for that is the boilerplate layer, which is provided by the base class.
__init__
- agreed. type checking, if done, is also done in the __init__
. That is a departure from older sklearn
designs, but newer sklearn
also is moving in this direction.
None
or simiar as a placeholder. This is not possible with the "check at construction" approach.sktime
- sounds like a partial misunderstanding. sktime
does not introduce new python types for sklearn
abstract types like clusterers etc, it introduces new types for new abstract types, e.g., time series clusterers.
sklearn
- tabular - estimators for time series tasks, I would argue that you are "abusing" the tabular interfaces for time series data.sklearn
explicitly considers those tasks out of scope, and is not expressive enough - the most important example for you, perhaps, forecastingUpvote. Maybe I'm missing something but what in the design prevents one writing a CovarianceEstimator that (say) uses skpro for variance or std forecasts (directly or factors or resids or whatever)? I don't see any blocker.
Though my question to Franz in this context is whether skpro intends to support .partial_fit() for some methods.
Though my question to Franz in this context is whether skpro intends to support .partial_fit() for some methods.
If you are considering time series streams where new data becomes available regularly, then you would use update
of sktime
proba forecasters for that - is that what you mean, @microprediction?
We had some discussions on whether partial_fit
was the right name for the method, and we thought that it was conceptually different from stream update of a model.
@fkiraly thank you for the detailed design thoughts. As you mentioned, we may need to relax type checking when using polymorphic estimators. I'll work on concrete use cases to find the optimal trade-off for the library.
@microprediction, indeed, nothing stops us from creating a custom CovarianceEstimator
that uses sktime
or skpro
. Below is a random example that uses sktime
GARCH forecast:
import numpy as np
from sktime.forecasting.arch import StatsForecastGARCH
from skfolio.datasets import load_sp500_dataset
from skfolio.moments import BaseCovariance
from skfolio.optimization import MeanRisk
from skfolio.preprocessing import prices_to_returns
from skfolio.prior import EmpiricalPrior
class MyCustomCovariance(BaseCovariance):
def __init__(self, p: int = 1, q: int = 1, nearest: bool = True):
super().__init__(nearest=nearest)
self.p = p
self.q = q
def fit(self, X, y=None):
X.index = X.index.to_period(freq="d")
forecaster = StatsForecastGARCH(p=self.p, q=self.q)
forecaster.fit(X)
pred = forecaster.predict(fh=np.arange(30))
covariance = np.cov(pred.T)
self._set_covariance(covariance)
return self
prices = load_sp500_dataset()
X = prices_to_returns(prices)
model = MeanRisk(
prior_estimator=EmpiricalPrior(covariance_estimator=MyCustomCovariance())
)
model.fit(X)
print(model.weights_)
Very nice package, stringently designed!
I was wondering whether you were thinking about
sktime
integration?sktime
has native support fo hierarchical data, pairwise distance transformers for time series, etc.skpro
.scikit-base
, which is similar to your "importBaseEstimator
" but imo more architecturally stringent. Especially if you want unified interface tests that go beyondsklearn
check_estimator
.Opening an issue since I'm not quite sure what the best way is to get in touch.