[ENH] The ability to see what data will be passed to the final estimator.

davidgilbertson commented 1 year ago

Is your feature request related to a problem? Please describe. When new to sktime, I'm finding there's a lot of 'guess and check' to find what effect a transformer will have (what it will do to the data, what it will name the output columns, whether my model will get a NumPy array or a DataFrame). And to intercept this it means running in debugger mode with a breakpoint in the estimator's fit method, which is cumbersome.

Describe the solution you'd like I've written my own helper that works for my narrow use case. It's a function that accepts a pipeline and some data, and a string telling it what to 'intercept' (for me, the final estimator). It will capture each call to .fit (so I can also check that it's being called for each instance as expected), and I can extract the X/y from any of those calls.

runs = forecaster_fit_dry_run(
    pipe,
    X=X_train,
    y=y_train,
    target="transform_y__forecast__estimator",
)
forecaster_X, forecaster_y = runs[0]

The implementation looks like this:

def forecaster_fit_dry_run(pipeline, X, y, target):
    pipeline = pipeline.clone()

    captured_args = []

    # Create a mock that will capture the args passed to fit()
    class MockForecaster(BaseForecaster):
        def fit(self, *args):
            captured_args.append(args)

    pipeline.set_params(**{target: MockForecaster()})
    pipeline.fit(X=X, y=y, fh=[1])
    return captured_args

Describe alternatives you've considered A utility function is cool, but a pipeline.fit(..., dry_run=True) and pipeline.predict(..., dry_run=True) would be great, especially if there's a generic way to find the final estimator that works for all use cases, and a sensible return value that makes sense for all use cases.

fkiraly commented 1 year ago

Quick side note - there already is a MockForecaster, it lives in sktime.utils.estimators - it is a luxury version of your MockForecaster (with logger etc).

Regarding your main question:

do you have any designs in mind how "show me the columns/shape of te data passed" could look like, in code?
FYI @benHeid - this seems like an interesting suggested design requirement for graphical pipelines?

benHeid commented 1 year ago

Yes it sounds a bit like the callback functionality that I implemented in pyWATTS to log/visualise/save intermediate results.

Should not be a major problem to enable this also in sktime. However, if we want to implement this, we should probably think about the corresponding API in more detail.

fkiraly commented 1 year ago

we should probably think about the corresponding API in more detail.

That's the idea :-)

benHeid commented 1 year ago

To shortly summarise the existing API in pyWATTS and to potentially starting discussion.

There is a base class called BaseCallback. This class has the method __call__, which receives the output data of the step to which it is attached. This method then performs then the desired step. E.g., logging the data meta information, writing the intermediate data to a CSV file, npy file, ..., visualising the data using lineplots.

The basic structure of the base class would be something like:

class BaseCallback:

    def __init__(...):
        # Do some initialisation stuff.

    @abc.abstract_method
    def __call__(data):
        pass

    # Perhaps some additional utility functions! Like figuring out the global path where the intermediate data should be stored. etc.

This abstract base class might then be initialised using specific classes e.g. CSVCallback, LoggerCallback, LineplotCallback, NPYCallback, etc.

TODOs

[ ] Discuss if this is the API we want to follow.
[ ] Probably we should use call/apply or something else instead of the dunder, since the user never calls the dunder.
[ ] Are callbacks parameters of the pipeline? I.e., should they be via returned/set get_params, set_params calls?

fkiraly commented 1 year ago

Interesting! I think callbacks are also sth that users of deep learning methods will want, in that sense it could also solve one of the major shortcomings (imo) with a skorch-like desigh (combination of pytorch and scikit-learn APIs).

Should we open a design issue how this should look like? I really like the callback/logger idea, and doing it step-wise in a pipeline makes a lot of sense!

Although we should also think how we capture this in the reduction case, which is a composition that's not a pipeline (it is an sktime estimator that wraps an sklearn estimator in a way that's not expressible by the current graph pipeline design).

benHeid commented 1 year ago

I am not sure if we should discuss this here or create an additional issue. I am fine with both.

Can you explain the compositor problem in detail. I am not sure if I understand this correctly.

fkiraly commented 1 year ago

I am not sure if we should discuss this here or create an additional issue. I am fine with both.

I would suggest, separate issue on design of callbacks, with reference for this?

I see these as separate problems:

@davidgilbertson wants a way to inspect column names that we expect to arise from a pipeline
a potential solution are callbacks from pipelines/composites and logging, but that's a more general component that is worth a design discussion

Can you explain the compositor problem in detail. I am not sure if I understand this correctly.

The "reducer" is sth like ReducerForecaster(sklearn_regressor, more_params) - this does not fit the pipeline paradigm, in the sense that it isn't a pipeline chain step (cannot be expressed as `my_trafo * my_est or similar).

Of course you could see it as a "pipe" step in the sense of magrittr, shere sklearn_regressor is "piped" into ReducerForecaster, but the sequentiality doesn't hold as ReducerForecaster isn't a function but a class.

Does this help explain, or am I rambling?

benHeid commented 1 year ago

Ok let's create a new issue. You are right.

I believe I got the point about the reduction. Just confused myself again, since sometimes I think still in the pyWATTS solution that considers the reduction as a separate pipeline step called Sampler or Select.

fkiraly commented 1 year ago

would you like to create the new issue, @benHeid? I would guess you have the best idea on how this could look like.

davidgilbertson commented 1 year ago

do you have any designs in mind how "show me the columns/shape of the data passed" could look like, in code?

I was simply thinking of a function that returned the data, rather than perform any predefined summary/reporting of the data. Is that what you meant?

The goal is fast iteration, so a solution that saved to file wouldn't be ideal, and the data can be huge, so a solution that printed to the console wouldn't be great. If I just get the args back of the method I specify (e.g. fit of a particular estimator) then I can see at a glance what they look like.

Also, this 'dry run' code is is the sort of code a user would add for a bit while fiddling with the transforms, then delete. So a single statement would be ideal. Either of these I think would be fine:

captured_args = pipe.fit(..., dry_run=True)  # assumes final estimator
captured_args = pipe.capture(target="transform_y__forecast__estimator", method="fit")  # captures any estimator method call

These both assume that the method you want to intercept on the estimator is the same one you call on the pipeline, maybe that's a poor assumption.

fkiraly commented 1 year ago

Hm, @davidgilbertson, this should already be possible by

from sktime.utils.estimators import make_mock_estimator

est_with_logger = make_mock_estimator(est)
pipe = PipelineClass(stuff, stuff, est_with_logger, stuff)

pipe.fit(stuff)
pipe.get_fitted_params()["transform_y__forecast__estimator"].log

I see that it might be tedious to find the logger though as a fitted attribute somewhere.

Test & feedback appreciated!

sktime / sktime

[ENH] The ability to see what data will be passed to the final estimator. #4808

TODOs