superduper-io / superduper

Superduper: Integrate AI models and machine learning workflows with your database to implement custom AI applications, without moving your data. Including streaming inference, scalable model hosting, training and vector search.
https://superduper.io
Apache License 2.0
4.66k stars 448 forks source link

Design discussion - estimator/learner interface #213

Closed fkiraly closed 1 year ago

fkiraly commented 1 year ago

Why

From @fkiraly code review - seems the estimator logic is too decoupled and spread over the place.

E.g., a sklearn estimator now sits in a Model child (predict logic), and two classes linked to TrainingConfiguration. Application/connection is mediated by a fourth class. This is not good, too coupled while the logic is distributed everywhere.

How

I think we need to arrive at a better, more cohesive design, with the strategy pattern. The current design applies the strategy pattern, but it's all over the place.

What

We will discuss here until conclusion.

fkiraly commented 1 year ago

@fkiraly original comments from review:

fkiraly commented 1 year ago

@blythed reply:

So vis-á-vis your idea of adding training to all relevant estimators, we would need considerable flexibility.

So for instance for torch we could have something like this:

def fit(self, X: Union[np.array, str], y,: Union[np.array, str], dataset: torch.utils.data.Dataset = None):
    ... # if X is an array then ignore or force dataset to be `None`
    ... # else if X is a str, then y should be too, and these are used as keys for the dictinonaries which come out of `dataset`

A model class could also have a @classmethod telling us how to get the data “iterator” or whatever is necessary for lazy dataloading, from the database and query.

Sometimes we may have 1, 2, 3, … arguments to fit, depending on the algorithm. In this case, the convention would be these are the *args: def fit(self, X, Y, Z, ...):

The learning parameters may be added to the init or to the fit method as an additional parameter, as well as potential other classes which interface with the infrastructure (e.g. multi-GPU training setup etc.)

For predict we would like the opportunity for postprocessing - this is essential for many algorithms - e.g. neural translation, NLP etc.. (Not really relevant for, e.g. time-series forecasting). Then we could have an optional method predict_with_postprocess.

With regards to your comments about predict vs. predict_one; in sklearn I believe this is handled in one function because the algorithms “know” the shape of the “rows” of input. (edited)

In PyTorch we could handle this by either additional arguments; for example, if dataloader arguments are provided, then we know that we are going for batched input.

What we can’t do is infer just because we have a list that this is multiple datapoints, since sometimes a list might be a single datapoint.

In the current SuperDuperDB code, we actually always “know” which situation we are in. So we could pass batched=True or something like that.

My original misgiving was that we might have cases, for instance in representation learning, where we have multiple “encoders” (>=2) which might be learned together, but are applied in completely different ways at inference time.

Here the concern is: It is wasteful to load everything at inference time of just one model. We may like to consider these encoders as separate entities

Can be easily solved with some kind of lazy loading - we don’t necessarily need this in the first release I am willing to dispense with this

To enable the current features with “submodels” as inference routines we would need to make a few changes, essentially allow users to load not use model "my_model" but also "my_model.submodel"

I don’t see any of this as problematic, and I think we could get cracking very soon on this.

I can write up a more substantial document if you like, but I wanted to share my thoughts while they were fresh in my mind.

The approach for other frameworks - transformers, tensorflow, etc. could go along similar lines.

Another thing, which we’ll probably need to add is a trainer object as an optional input to the fit method. The reason for this, is that many models can be trained with multiple different losses, and sometimes a model is pre-trained with one loss-function, and then fine-tuned with another loss function.

def fit(self, X: Union[np.array, str], y,: Union[np.array, str], trainer: Trainer): ...

The trainer could potentially be added in the init method

What I would like to avoid is saving huge artifacts such as optimizer states together with the model.

fkiraly commented 1 year ago

regarding the above, I don't fully grasp your logic and the necessity of conclusions here.

That is, to me it seems like you consider some unspecified "option 1" (or a family thereof) and the status quo as "option 2". You tick your arguments against option 1 and conclude then option 2 is what we must go with.

Where option 1 is not spelled out anywhere, and as far as I can infer possibly a strawman option, or a concrete (bad) design that you've seen (perhaps - naive attempts to move deep learning estimators into a scikit-learn design? admittedly most of these are not great).

Given this, it is hard to follow a series of arguments of the kind "if we wouldn't do X as it currently is, we would have to do some unspecified Y, and that is bad because..."

Some concrete comments about this:

fkiraly commented 1 year ago

from discussion with @blythed - I apparently misunderstood what @blythed meant. Paraphrasing, he said he agrees with the basic need to be more cohesive in the design, but has concerns how to deal with common deep learning patterns, e.g., "trainer" style objects in pytorch lightning.

Summary of meeting below, and some designs.


meeting 2023-06-06 - estimator interface design discussion

attendees: DB, FK

model training parameters, state handling for neural networks

FK design suggestion

FK uses principles:

special requirements arise from:

untrained_model = Nnspec(layer1=foo, connections=bar)

untrained_model_w_pretraining = untrained_model * PretrainInstructions(stuff, loss)

optional: can move untrained_model_w_pretraining in blueprint registry

pretrained_model = untrained_model_w_pretraining.fit(data_ref)

pretrained_model_w_more_training = pretrained_model * TrainInstructions(more_stuff)

optional: can move pretrained_model or pretrained_model_w_more_training to pretrained model registry (or get it from there)

application step - can be batch-wise

pretrained_model_w_more_training.fit(data_ref, maybe_label_ref)

or .fit(task) would also work

Questions:

postprocessing

(meeting ends here, FK will produce suggested design)

current design is like

my_estimator = EstClass(sthsth)

my_estimator.predict(data_or_task, postprocess=stuff)

in a composable design, this would be instead

my_estimator = EstClass(sthsth)

postproc = PostProc(stuff)

# sklearn-style
my_pipe = Pipeline([
    ("est", my_estimator),
    ("postproc", postproc),
])

# same, also sklearn
my_pipe = make_pipeline(my_estimator, postproc)

# or, with syntactic sugar
my_pipe = my_estimator * postproc

The composable design is more extensible - it does not require to register new "postproc" steps with the predict method.

blythed commented 1 year ago

In PyTorch or Tensorflow, it's important to understand which parts of the Pipeline are executed on the CPU and which on the GPU. In PADL (https://github.com/lf1-io/padl) we used a "trivial" transform which signified where to split the pipeline:

t = (
     step1
     >> step2
     >> batch
     >> Linear(10, 20)
     >> unbatch
     >> postprocess
)

The point is that the part before batch is "mapped" over the dataset in multiprocessing, the part between batch and unbatch is executed on the GPU, and the part after is mapped over the outputs on the CPU.

In the sklearn world this would translate to extending the sklearn.pipeline.Pipeline concept, to add steps which should be done in loading, forward pass, or postprocessing.

m = TorchPipeline(
    preprocessing_steps=[
        ("pre1", my_pre1),
        ("pre2", my_pre2),
    ],
    forward_steps=[
        ('linear', Linear(10, 20)),
    ],
    postprocess_steps=[
        ('post1', my_post)
    ]
)
fkiraly commented 1 year ago

@blythed, thanks for the added insight!

In case you are trying to make a point or argument, would you mind spelling this out? I.e., is this just adding information, or is this a specific comment about a design? I agree with the TorchPipeline up to minimal interface changes.

That is, we can use padl as-is, modulo interface adaptation and homogenization. That's much better than stuffing the postprocess step into predict.

blythed commented 1 year ago

This is a suggestion of how we could extend/ make a similar Pipeline which is suitable for torch models. We would have attached to this, fit, predict, predict_prob methods as a minimum. Then maybe also predict_with_postprocess and generate methods, depending on what is passed, and the use case.

I don't think we should use PADL - we don't have a guarantee about maintenance. But I think that we can use some of the ideas. In fact, many of the ideas are already in SuperDuperDB, especially with regards torch.

I looked at some info. on sklearn, and one thing that can be done is to separate predict from post-processing, by adding transforms "at the end", which don't get called by predict, because they don't have a predict method. Do you have a(n) (alternative) suggestion for how to leverage postprocess

fkiraly commented 1 year ago

I see - great! This would in-principle allow us not to have a postprocess.

Do you have a(n) (alternative) suggestion for how to leverage postprocess

I would just not have that argument at all and attach postprocessing by composition, i.e., a minimal pipeline. That way, the specification remains extensible.

blythed commented 1 year ago

I see - great! This would in-principle allow us not to have a postprocess.

Do you have a(n) (alternative) suggestion for how to leverage postprocess

I would just not have that argument at all and attach postprocessing by composition, i.e., a minimal pipeline. That way, the specification remains extensible.

I agree that the Pipeline is a useful guide, but am not sure that the suggestion will work for PyTorch.

It sounds like that in sklearn, adding postprocessing, is simply adding extra steps. That won't work in PyTorch, since postprocessing is something to be executed on the CPU. Can you look at the class https://github.com/SuperDuperDB/superduperdb-stealth/blob/d6e075ffb2d00f72a69ef01e578e55ec4a46e3c6/superduperdb/models/torch/wrapper.py#L9 and confirm whether you see how this bears on our discussion?

blythed commented 1 year ago

I have an idea - a proposal of how to obviate the need for any specific postprocessing steps. In PyTorch batched steps could be inferred from checking the types of the transforms. If a transform is a PyTorch layer, then it should be in the forward pass. Everything after the last PyTorch layer is an unbatched/postprocessing step, everything before is a pre-processing step.

Is this clear? It's possible I feel that we're not on the same page as regards the meaning of "postprocessing". In my lexicon it means steps carried out after the forward pass.

m = TorchPipeline(
    steps=[
        ("pre1", my_pre1),                                   # <-- inferred pre-processing
        ("pre2", my_pre2),                                  # <-- inferred pre-processing
        ('linear', torch.nn.Linear(10, 20)),          # <-- inferred forward pass
        ('post1', my_post),                                   #  <-- inferred post-processing
    ]
)

# compose with syntactic sugar
n = m * my_other_postproc          

# equivalent to
m = TorchPipeline(
    steps=[
        ("pre1", my_pre1),                                   # <-- inferred pre-processing
        ("pre2", my_pre2),                                  # <-- inferred pre-processing
        ('linear', torch.nn.Linear(10, 20)),          # <-- inferred forward pass
        ('post1', my_post),                                   #  <-- inferred post-processing
        ('post2', my_other_postproc),                #  <-- inferred post-processing
    ]
)
fkiraly commented 1 year ago

Can you look at the class SuperDuperModule(torch.nn.Module, Model) and confirm whether you see how this bears on our discussion?

I don't see how/why this would disprove that we can add pre/postprocess by composition instead of argument. You add it as an argument of the constructor or a method, just mathematically/formally this could equally be added through composition syntax.

Is this clear? It's possible I feel that we're not on the same page as regards the meaning of "postprocessing". In my lexicon it means steps carried out after the forward pass.

Yes! This is the same as in sktime where we infer what a transformation does based on its relative position to the forecaster.