Design discussion - estimator/learner interface

fkiraly commented 1 year ago

Why

From @fkiraly code review - seems the estimator logic is too decoupled and spread over the place.

E.g., a sklearn estimator now sits in a Model child (predict logic), and two classes linked to TrainingConfiguration. Application/connection is mediated by a fourth class. This is not good, too coupled while the logic is distributed everywhere.

How

I think we need to arrive at a better, more cohesive design, with the strategy pattern. The current design applies the strategy pattern, but it's all over the place.

What

We will discuss here until conclusion.

fkiraly commented 1 year ago

@fkiraly original comments from review:

I think separating "fit" from "predict" is problematic and will paint us in a corner!
cohesion/coupling problem - "decoheres" the "model" concept; and couples the former parts all over the place
will create extensibility problem - developer confusion
would recommend not to use inheritance but composition and/or adapter pattern - inheritance is much harder to maintain and breaks more often!
would strongly recommend to use adapter pattern throughout, consistently, to allow for general interface plug & play
the "trainer" objects do too much, have too many args
it's crucial to document this and write down developer vignettes of intended internal use

fkiraly commented 1 year ago

@blythed reply:

So vis-á-vis your idea of adding training to all relevant estimators, we would need considerable flexibility.

So for instance for torch we could have something like this:

def fit(self, X: Union[np.array, str], y,: Union[np.array, str], dataset: torch.utils.data.Dataset = None):
    ... # if X is an array then ignore or force dataset to be `None`
    ... # else if X is a str, then y should be too, and these are used as keys for the dictinonaries which come out of `dataset`

A model class could also have a @classmethod telling us how to get the data “iterator” or whatever is necessary for lazy dataloading, from the database and query.

Sometimes we may have 1, 2, 3, … arguments to fit, depending on the algorithm. In this case, the convention would be these are the *args: def fit(self, X, Y, Z, ...):

The learning parameters may be added to the init or to the fit method as an additional parameter, as well as potential other classes which interface with the infrastructure (e.g. multi-GPU training setup etc.)

For predict we would like the opportunity for postprocessing - this is essential for many algorithms - e.g. neural translation, NLP etc.. (Not really relevant for, e.g. time-series forecasting). Then we could have an optional method predict_with_postprocess.

With regards to your comments about predict vs. predict_one; in sklearn I believe this is handled in one function because the algorithms “know” the shape of the “rows” of input. (edited)

In PyTorch we could handle this by either additional arguments; for example, if dataloader arguments are provided, then we know that we are going for batched input.

What we can’t do is infer just because we have a list that this is multiple datapoints, since sometimes a list might be a single datapoint.

In the current SuperDuperDB code, we actually always “know” which situation we are in. So we could pass batched=True or something like that.

My original misgiving was that we might have cases, for instance in representation learning, where we have multiple “encoders” (>=2) which might be learned together, but are applied in completely different ways at inference time.

Here the concern is: It is wasteful to load everything at inference time of just one model. We may like to consider these encoders as separate entities

Can be easily solved with some kind of lazy loading - we don’t necessarily need this in the first release I am willing to dispense with this

To enable the current features with “submodels” as inference routines we would need to make a few changes, essentially allow users to load not use model "my_model" but also "my_model.submodel"

I don’t see any of this as problematic, and I think we could get cracking very soon on this.

I can write up a more substantial document if you like, but I wanted to share my thoughts while they were fresh in my mind.

The approach for other frameworks - transformers, tensorflow, etc. could go along similar lines.

Another thing, which we’ll probably need to add is a trainer object as an optional input to the fit method. The reason for this, is that many models can be trained with multiple different losses, and sometimes a model is pre-trained with one loss-function, and then fine-tuned with another loss function.

def fit(self, X: Union[np.array, str], y,: Union[np.array, str], trainer: Trainer): ...

The trainer could potentially be added in the init method

What I would like to avoid is saving huge artifacts such as optimizer states together with the model.

fkiraly commented 1 year ago

regarding the above, I don't fully grasp your logic and the necessity of conclusions here.

That is, to me it seems like you consider some unspecified "option 1" (or a family thereof) and the status quo as "option 2". You tick your arguments against option 1 and conclude then option 2 is what we must go with.

Where option 1 is not spelled out anywhere, and as far as I can infer possibly a strawman option, or a concrete (bad) design that you've seen (perhaps - naive attempts to move deep learning estimators into a scikit-learn design? admittedly most of these are not great).

Given this, it is hard to follow a series of arguments of the kind "if we wouldn't do X as it currently is, we would have to do some unspecified Y, and that is bad because..."

Some concrete comments about this:

my primary high-level statement about the estimator (conceptual) related classes is, don't distribute the business of a single estimator (a concept in the design space) into four classes - as this leads to high coupling, low cohesion. Besides this, it does not imply anything specific beyond that on the implementation, e.g., which methods appear where.
on different init args and init parameters - we know which arg types which estimator takes, so can clearly parse this with some minimal typing.
on fit args - data args should go there, model parameterse in init
on postprocessing - don't agree with some claims of yours there. Postprocessing is important everywhere. What is absolutely not necessary to attach this to predict. Normally (in common modular designs), you would pipeline with a postprocessing estimator.
I get the point on memory and lazy loading, but don't get how and why it relates to this discussion, explanation would be appreciated

fkiraly commented 1 year ago

from discussion with @blythed - I apparently misunderstood what @blythed meant. Paraphrasing, he said he agrees with the basic need to be more cohesive in the design, but has concerns how to deal with common deep learning patterns, e.g., "trainer" style objects in pytorch lightning.

Summary of meeting below, and some designs.

meeting 2023-06-06 - estimator interface design discussion

attendees: DB, FK

model training parameters, state handling for neural networks

FK design suggestion

FK uses principles:

separate data from model parameters
training instructions are also model parameters
model parameters should be init/parameters, not method parameters

special requirements arise from:

in NN, training can be multi-step and branched
e.g., first pre-train a model, then store it in registry (persist)
the get pre-trained model from registry, train it in 100 independent instances
- this can be done by 100 different users
- then store the twice trained model again (persist)
therefore, "classical" sklearn fit/predict is not powerful enough and misleading
- it would not allow multi-step, branching process, with persistence
- FK: but we don't need to go with sklearn like in skorch

untrained_model = Nnspec(layer1=foo, connections=bar)

untrained_model_w_pretraining = untrained_model * PretrainInstructions(stuff, loss)

optional: can move untrained_model_w_pretraining in blueprint registry

pretrained_model = untrained_model_w_pretraining.fit(data_ref)

pretrained_model_w_more_training = pretrained_model * TrainInstructions(more_stuff)

optional: can move pretrained_model or pretrained_model_w_more_training to pretrained model registry (or get it from there)

application step - can be batch-wise

pretrained_model_w_more_training.fit(data_ref, maybe_label_ref)

or .fit(task) would also work

Questions:

DB: why is TrainInstruction ettc not part of fit? could be there
- FK: highest-level principle in this design is, model specification should be init parameter; data should be method arg
- and understand training instructions as part of model specification (because, conceptually, it's not clear to separate, where does "model parameter" end and "training parameter" begin)
- of course this need not be a high-level principle.

postprocessing

DB - situation as follows, neural encode-decode (translation transformer)
- at prediction time, get tensor out is not useful
- post-processing hence needs to live with the model, as it is model dependent
- therefore predict needs to have postprocess argument
FK: last point does not follow, could externalize postprocess as model component
- in fact, that's what commonly happens in transformation pipelines!
- DB: don't want to piece it together, just load, at inference time!
- FK: so just serialize the entire pipeline? Needs no interface change

(meeting ends here, FK will produce suggested design)

current design is like

my_estimator = EstClass(sthsth)

my_estimator.predict(data_or_task, postprocess=stuff)

in a composable design, this would be instead

my_estimator = EstClass(sthsth)

postproc = PostProc(stuff)

# sklearn-style
my_pipe = Pipeline([
    ("est", my_estimator),
    ("postproc", postproc),
])

# same, also sklearn
my_pipe = make_pipeline(my_estimator, postproc)

# or, with syntactic sugar
my_pipe = my_estimator * postproc

The composable design is more extensible - it does not require to register new "postproc" steps with the predict method.

blythed commented 1 year ago

In PyTorch or Tensorflow, it's important to understand which parts of the Pipeline are executed on the CPU and which on the GPU. In PADL (https://github.com/lf1-io/padl) we used a "trivial" transform which signified where to split the pipeline:

t = (
     step1
     >> step2
     >> batch
     >> Linear(10, 20)
     >> unbatch
     >> postprocess
)

The point is that the part before batch is "mapped" over the dataset in multiprocessing, the part between batch and unbatch is executed on the GPU, and the part after is mapped over the outputs on the CPU.

In the sklearn world this would translate to extending the sklearn.pipeline.Pipeline concept, to add steps which should be done in loading, forward pass, or postprocessing.

m = TorchPipeline(
    preprocessing_steps=[
        ("pre1", my_pre1),
        ("pre2", my_pre2),
    ],
    forward_steps=[
        ('linear', Linear(10, 20)),
    ],
    postprocess_steps=[
        ('post1', my_post)
    ]
)

fkiraly commented 1 year ago

@blythed, thanks for the added insight!

In case you are trying to make a point or argument, would you mind spelling this out? I.e., is this just adding information, or is this a specific comment about a design? I agree with the TorchPipeline up to minimal interface changes.

That is, we can use padl as-is, modulo interface adaptation and homogenization. That's much better than stuffing the postprocess step into predict.

blythed commented 1 year ago

This is a suggestion of how we could extend/ make a similar Pipeline which is suitable for torch models. We would have attached to this, fit, predict, predict_prob methods as a minimum. Then maybe also predict_with_postprocess and generate methods, depending on what is passed, and the use case.

I don't think we should use PADL - we don't have a guarantee about maintenance. But I think that we can use some of the ideas. In fact, many of the ideas are already in SuperDuperDB, especially with regards torch.

I looked at some info. on sklearn, and one thing that can be done is to separate predict from post-processing, by adding transforms "at the end", which don't get called by predict, because they don't have a predict method. Do you have a(n) (alternative) suggestion for how to leverage postprocess

fkiraly commented 1 year ago

I see - great! This would in-principle allow us not to have a postprocess.

Do you have a(n) (alternative) suggestion for how to leverage postprocess

I would just not have that argument at all and attach postprocessing by composition, i.e., a minimal pipeline. That way, the specification remains extensible.

blythed commented 1 year ago

I see - great! This would in-principle allow us not to have a postprocess.

Do you have a(n) (alternative) suggestion for how to leverage postprocess

I would just not have that argument at all and attach postprocessing by composition, i.e., a minimal pipeline. That way, the specification remains extensible.

I agree that the Pipeline is a useful guide, but am not sure that the suggestion will work for PyTorch.

It sounds like that in sklearn, adding postprocessing, is simply adding extra steps. That won't work in PyTorch, since postprocessing is something to be executed on the CPU. Can you look at the class https://github.com/SuperDuperDB/superduperdb-stealth/blob/d6e075ffb2d00f72a69ef01e578e55ec4a46e3c6/superduperdb/models/torch/wrapper.py#L9 and confirm whether you see how this bears on our discussion?

blythed commented 1 year ago

I have an idea - a proposal of how to obviate the need for any specific postprocessing steps. In PyTorch batched steps could be inferred from checking the types of the transforms. If a transform is a PyTorch layer, then it should be in the forward pass. Everything after the last PyTorch layer is an unbatched/postprocessing step, everything before is a pre-processing step.

Is this clear? It's possible I feel that we're not on the same page as regards the meaning of "postprocessing". In my lexicon it means steps carried out after the forward pass.

m = TorchPipeline(
    steps=[
        ("pre1", my_pre1),                                   # <-- inferred pre-processing
        ("pre2", my_pre2),                                  # <-- inferred pre-processing
        ('linear', torch.nn.Linear(10, 20)),          # <-- inferred forward pass
        ('post1', my_post),                                   #  <-- inferred post-processing
    ]
)

# compose with syntactic sugar
n = m * my_other_postproc          

# equivalent to
m = TorchPipeline(
    steps=[
        ("pre1", my_pre1),                                   # <-- inferred pre-processing
        ("pre2", my_pre2),                                  # <-- inferred pre-processing
        ('linear', torch.nn.Linear(10, 20)),          # <-- inferred forward pass
        ('post1', my_post),                                   #  <-- inferred post-processing
        ('post2', my_other_postproc),                #  <-- inferred post-processing
    ]
)

fkiraly commented 1 year ago

Can you look at the class SuperDuperModule(torch.nn.Module, Model) and confirm whether you see how this bears on our discussion?

I don't see how/why this would disprove that we can add pre/postprocess by composition instead of argument. You add it as an argument of the constructor or a method, just mathematically/formally this could equally be added through composition syntax.

Is this clear? It's possible I feel that we're not on the same page as regards the meaning of "postprocessing". In my lexicon it means steps carried out after the forward pass.

Yes! This is the same as in sktime where we infer what a transformation does based on its relative position to the forecaster.

superduper-io / superduper