Open davidgilbertson opened 1 year ago
Quick side note - there already is a MockForecaster
, it lives in sktime.utils.estimators
- it is a luxury version of your MockForecaster
(with logger etc).
Regarding your main question:
Yes it sounds a bit like the callback functionality that I implemented in pyWATTS to log/visualise/save intermediate results.
Should not be a major problem to enable this also in sktime. However, if we want to implement this, we should probably think about the corresponding API in more detail.
we should probably think about the corresponding API in more detail.
That's the idea :-)
To shortly summarise the existing API in pyWATTS and to potentially starting discussion.
There is a base class called BaseCallback. This class has the method __call__
, which receives the output data of the step to which it is attached. This method then performs then the desired step. E.g., logging the data meta information, writing the intermediate data to a CSV file, npy file, ..., visualising the data using lineplots.
The basic structure of the base class would be something like:
class BaseCallback:
def __init__(...):
# Do some initialisation stuff.
@abc.abstract_method
def __call__(data):
pass
# Perhaps some additional utility functions! Like figuring out the global path where the intermediate data should be stored. etc.
This abstract base class might then be initialised using specific classes e.g. CSVCallback
, LoggerCallback
, LineplotCallback
, NPYCallback
, etc.
get_params
, set_params
calls?Interesting! I think callbacks are also sth that users of deep learning methods will want, in that sense it could also solve one of the major shortcomings (imo) with a skorch
-like desigh (combination of pytorch
and scikit-learn
APIs).
Should we open a design issue how this should look like? I really like the callback/logger idea, and doing it step-wise in a pipeline makes a lot of sense!
Although we should also think how we capture this in the reduction case, which is a composition that's not a pipeline (it is an sktime
estimator that wraps an sklearn
estimator in a way that's not expressible by the current graph pipeline design).
I am not sure if we should discuss this here or create an additional issue. I am fine with both.
Can you explain the compositor problem in detail. I am not sure if I understand this correctly.
I am not sure if we should discuss this here or create an additional issue. I am fine with both.
I would suggest, separate issue on design of callbacks, with reference for this?
I see these as separate problems:
Can you explain the compositor problem in detail. I am not sure if I understand this correctly.
The "reducer" is sth like ReducerForecaster(sklearn_regressor, more_params)
- this does not fit the pipeline paradigm, in the sense that it isn't a pipeline chain step (cannot be expressed as `my_trafo * my_est or similar).
Of course you could see it as a "pipe" step in the sense of magrittr
, shere sklearn_regressor
is "piped" into ReducerForecaster
, but the sequentiality doesn't hold as ReducerForecaster
isn't a function but a class.
Does this help explain, or am I rambling?
Ok let's create a new issue. You are right.
I believe I got the point about the reduction. Just confused myself again, since sometimes I think still in the pyWATTS solution that considers the reduction as a separate pipeline step called Sampler or Select.
would you like to create the new issue, @benHeid? I would guess you have the best idea on how this could look like.
do you have any designs in mind how "show me the columns/shape of the data passed" could look like, in code?
I was simply thinking of a function that returned the data, rather than perform any predefined summary/reporting of the data. Is that what you meant?
The goal is fast iteration, so a solution that saved to file wouldn't be ideal, and the data can be huge, so a solution that printed to the console wouldn't be great. If I just get the args back of the method I specify (e.g. fit
of a particular estimator) then I can see at a glance what they look like.
Also, this 'dry run' code is is the sort of code a user would add for a bit while fiddling with the transforms, then delete. So a single statement would be ideal. Either of these I think would be fine:
captured_args = pipe.fit(..., dry_run=True) # assumes final estimator
captured_args = pipe.capture(target="transform_y__forecast__estimator", method="fit") # captures any estimator method call
These both assume that the method you want to intercept on the estimator is the same one you call on the pipeline, maybe that's a poor assumption.
Hm, @davidgilbertson, this should already be possible by
from sktime.utils.estimators import make_mock_estimator
est_with_logger = make_mock_estimator(est)
pipe = PipelineClass(stuff, stuff, est_with_logger, stuff)
pipe.fit(stuff)
pipe.get_fitted_params()["transform_y__forecast__estimator"].log
I see that it might be tedious to find the logger though as a fitted attribute somewhere.
Test & feedback appreciated!
Is your feature request related to a problem? Please describe. When new to sktime, I'm finding there's a lot of 'guess and check' to find what effect a transformer will have (what it will do to the data, what it will name the output columns, whether my model will get a NumPy array or a DataFrame). And to intercept this it means running in debugger mode with a breakpoint in the estimator's
fit
method, which is cumbersome.Describe the solution you'd like I've written my own helper that works for my narrow use case. It's a function that accepts a pipeline and some data, and a string telling it what to 'intercept' (for me, the final estimator). It will capture each call to
.fit
(so I can also check that it's being called for each instance as expected), and I can extract the X/y from any of those calls.The implementation looks like this:
Describe alternatives you've considered A utility function is cool, but a
pipeline.fit(..., dry_run=True)
andpipeline.predict(..., dry_run=True)
would be great, especially if there's a generic way to find the final estimator that works for all use cases, and a sensible return value that makes sense for all use cases.