implement data transformation pipelines which get serialized to the POJOs

exalate-issue-sync[bot] commented 1 year ago

We've had a bunch of calls for scikit-learn style data pipelines. It's time.

These pipelines have to be runnable in an MRTask across the training data when building a model, and also serializable to the scoring POJOs so that they run on a single row of raw input data. Pipelines will be first-class objects in the DKV, with their own REST API endpoint.

Currently we interpret Rapids ASTs as they come over the wire. The requirement that we be able to run the steps in an MRTask or in a POJO suggests that the right way to go is to "print" Rapids ASTs as a Java body which can be embedded into an MRTask or which can run standalone. This body might be slightly different for the two cases, but they should be generated by the same code and as similar as possible to minimize bugs due to differences.

As a separate task we would then convert the Rapids endpoint to use this "compile an MRTask on the fly" technique, with caching and perhaps code templates for common cases for improved performance. This would ensure that the expressions evaluate identically in the pipeline and non-pipeline cases.

Given a set of Rapids expressions, we know from the input Frame and the output Frame what the column types are for the munging method.

The POJOs currently require an array of doubles. We should additionally generate a properly typed predict() method. In addition, it's likely that the developer will want to send us an unparsed line of data. Given a parseSetup we should be able to generate a single-line parser for the POJO. So: predict(double[]), predict(String, int, double...) and predict(String). Take care about collisions if all the columns are doubles or there's only a single String column. :-)

In the use case we heard about yesterday the customer wants to call out to a web service at scoring time and join the data that is returned, to create one of the rows. At training time that joined data will obviously already be in the training data. We, at least for now, will assume that all the fields present in the training frame will also be present at prediction time. If the customer needs to call out and join they will do this before calling the transformAndPredict() method of the POJO.

Model POJO generation will optionally take a pipeline ID as a parameter, and serialize the pipeline into the POJO. The POJO will then have two kinds of methods: predict() and transformAndPredict(), times the three kinds of parameter lists listed above. Four of the six will be simple helpers based on the others.

The current plan is to define a pipeline and add expressions to it explicitly. We may decide in a future release to automatically track the Rapids expressions that are run on each Frame so that ad-hoc usage can lead easily to a pipeline.

Pipelines should be exportable / importable.

Caveats / questions:

There are obviously some operations that don't make sense when one has a single row. Perhaps these all have reduce() steps? Not sure. But they should be disallowed in pipelines.

There are obviously opportunities for optimizations across the ASTs. That's a separate task.

Should we be able to add Java expressions to a pipeline from the Java API?

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: Next steps:

o Get the customer's use case; in the meantime:

o Hand-write example Java code bodies for Chicago Crime and the first (non-join) part of Citibike, and get them to work inside an MRTask and the POJO.

o Start looking at how to "print" them from a list of Rapids ASTs.

o Add typed predict() to the POJOs.

o Write a single-line row parser based in ParseSetup and maybe Parse output. Add this to the POJOs, along with a helper predict() method.

exalate-issue-sync[bot] commented 1 year ago

Cliff Click commented: "Plan on Change, instead of Changing your Plans"

i.e. Protoype!!!!

All these comments "it shall be this or that" are short-sighted.
We honestly don't know what "it shall be", we're making stuff up as we go along.
Admit that up front! i.e., acknowledge change & rapid turnover. Plan on it. Plan on whatever we do, being crap the 1st or 2nd go-round, and flipping as needed. e.g. this is the FORTH major rewrite of the Exec/Rapids stuff (3rd rewrite of data storage; fluid-vecs was a major rewrite of ValueArray, itself a rewrite of a prototype that came before it; MRTask is also the 3rd major rewrite of the whole distributed-execution-framework).

So don't box yourself into "it shall be this" up front. Premature spec writing.

Do toss out ideas to be brainstormed over, experimented with; lots of good stuff in the opening comments. Do attempt some of the above stuff ("pipeline as 1st-class, in the DKV & REST"). Do NOT make heroic attempts at everything list above, because some of it will be insanely hard... and not worth it.
That "1 line parser" sound trivial, but is actually gonna be really hard - 'cause no parser on the planet parses all things. And some of it is probably bogus crap, which you don't know is bogus until after the customer problem gets solved, and everybody sighs relief and you look and realize you never really needed Thing X - but was in the spec... so we burned some engineer making it...

Do NOT skip hard customer problems because they look hard - "We, at least for now, will assume that all the fields present in the training frame will also be present at prediction time... If the customer needs to call out and join they will do this before calling (H2O)..." - wrong, wrong wrong. Customer needs to call-out in the middle of the scoring pipeline. Cust was very clear on this.... i.e. Customer needs to interleave customized bits with H2O-gen'd bits. Plan on it. Help it. We will NOT have all the fields from start-to-end of the pipeline. Instead, customer will likely munge some (which we can do), call out to weirdo stuff (external web interfaces, DB calls), probably passing along the munged stuff as keys, get results, fold them into the munging pipeline. Lather, rinse repeat a few times. Then pump that into a model. Then munge the results, back into another model, more customized callouts, etc...

Let's build solutions for the one or two data pipeline problems we have in-house. Let's build them end-to-end, withOUT putting down a lot of restrictions on "future flexibility" or "this is how we save the world" or "all of data munging". Let's be flexible and fast and ... GET TO THE END.
Prototype the WHOLE of these solutions; ugly, built quick, nasty hack shortcuts, the whole nine yards. But we'll experience the entire range of the problem, and several (hacky) attempts at solving them. Then look back at what we built, and with some 20/20 hindsight, we'll have a vastly better idea of what a "spec" should be.

There will be a rewrite of the "H2O data step", probably several.
Let's plan on it from the start.

Cliff

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: I agree. That's why "hand hack a Java body that works both ways, for our two in-house use cases" is the first step. :-)

The rest of it is all current thinking, which will surely change as we prototype.

As for the one-line parser generator, this is something that Michal has been wanting for his Sparkling Water demos, and he intends to hand it to his Czech shadow. This is why it's higher priority than it otherwise would be.

exalate-issue-sync[bot] commented 1 year ago

Spencer Aiello commented: so we veto'd "cbind" as an op in the assembly line, but it's a totally legitimate part of the stream processing flow: i.e. fuse two streams (i.e. combine 2 rows into 1 row).

the cbind step is simply:

run pipeline2: (row -> row)
combine result of pipeline2: (row,row) -> row)

I do think having a single source and a sink per assembly is a good enough first approximation

exalate-issue-sync[bot] commented 1 year ago

Mark Landry commented: My thoughts thus far:

There isn't a requirement to implement a pipeline to execute the customer request. Java glue will suffice. If we want an excuse to try it out, fair enough, but we do not need to think that the current use case necessitates a scoring pipeline.

A scoring pipeline, as assumed in this ticket, is not the same thing as having a transform POJO. Having a concept of a transform is quite useful independently, and then we can ask users to chain the two together themselves until a pipeline concept exists. It is not very useful to have a pipeline without a concept of a transform but plenty useful to have a transform without a pipeline. Ensembles/superlearner are a good exception, however (so what is the plan there, anyway?).

The requests I have heard for a concept of a pipeline is most often asking for either (1) a statistically sound cross-validation approach; or (2) a method of creating features that feed a model fit without intermediate steps (e.g. CSVs); or (3) both. Notice that this is not scoring, but modeling. So having a scoring pipeline is not the same as modeling pipeline, and we should be sure which of these we want to approach. Both are good. The implementation details of this ticket are concerned with a scoring pipeline.

I agree with proceeding to get something operational, rather than having all the details worked out in advance. So the following is just a discussion of how architecture choices may impact usability. A good implementation of a modeling pipeline or scoring pipeline will allow multiple transforms and multiple predicts, in any order. Even if you can wrap multiple transformation expressions in a single transform call, it would still be beneficial to have a transform step occurring after a model. Simple use case: I almost always wrap a min/max (1/0) around binary classification from GBM because it can go beyond those bounds and there's no benefit for not capping it. Today we ask our users to do that transformation outside the POJO. No problem. But a flexible pipeline that allows a predict then transform (or T/P/T, or T/P/P/T, or whatever) would be a great benefit of a flexible implementation. To me, that seems like a Pipeline POJO which is made up of 0+ transforms objects and 0+ prediction objects; but, whatever the implementation, the end result would be nice.

Scikit's pipeline is a useful template for a modeling pipeline: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html " Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. "

exalate-issue-sync[bot] commented 1 year ago

Spencer Aiello commented: If I understand correctly, the distinction is:

A. The customer request is really about creating an exportable flow of data transformations that happen per row. (this is what I'm thinking of as an Assembly type object) B. sklearn/mllib Pipeline is a different animal altogether.

This ticket currently blends the two together, and we should make it clear that we're greedily going after A, so don't hope for B in the short term?

exalate-issue-sync[bot] commented 1 year ago

Mark Landry commented: Yeah, I think so. This ticket may not blend them together, even, perhaps it's just me. I think more times I get asked about pipelines than not, it's about a modeling pipeline, but I could be wrong; just important to ensure we all understand the difference and focus on the one we want.

The current customer request fits a scoring pipeline (A) fairly well. However, it's possible to be successful without any changes to H2O: Java glue to get an end-of-the-day JAR file with H2O as part can be successful. We don't want to burn a developer getting an official pipeline and official transform in the product when Java glue was all that was necessary today.

exalate-issue-sync[bot] commented 1 year ago

Prithvi Prabhu commented: [~accountid:557058:76e612a8-b669-4d2c-a22a-d4ccc3e1bf2b]: correct.

A first class notion of (row -> row -> row -> ...) transformations, captured as a "sequence of row transformations" should be all that is required for a rapids -> java codegen to jam into a pre-score stage in the pojo.

Make this highly restricted in what's allowed in a "row -> row" transformation (a tuple -> tuple map op), and we'll have a good, functional first cut.

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: The one thing that I don't like about scikit-learn pipelines is that they always end in a model builder. A sequence of row -> row transforms ought to be reusable.

How about we call these something like Transform and TransformSequence to make it clear what we're talking about?

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-1882 Assignee: Spencer Butt Reporter: Raymond Peck State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2oai / h2o-3

implement data transformation pipelines which get serialized to the POJOs #14841