i2mint / know

Funnel live streams of data into storage and other processes
Apache License 2.0
0 stars 0 forks source link

Deploying Solutions #10

Open thorwhalen opened 1 year ago

thorwhalen commented 1 year ago

Once you (the "source") build a data processing pipeline, with its data preppers, featurizers, models, etc. how do you get it running on some other "target" system? As usual, this depends on the context: What are the computing capabilities of the target? What are the security and privacy requirements? Etc.

The following issue is meant to make a map of the options and discuss the pros and cons.

Aspects/Options

In a nutshell:

As always, distinctions can be fuzzy, but these methods are useful to keep in mind as they all have something to offer. The first two are quite known so need no description (but do need pro/con comparative analysis). The others need some further explanation.

http service wrap

In this method, we use py2http to create web-services from our python objects. The idea is then that, on the target side, the client launches the service and uses the REST API to interface with it. Since REST API is platform independent, the client can interface with the deployed service from any system they want with what-ever language they want.

Good? Well it's even better!

py2http can also generate an OpenAPI specification json of the services, which can (arguably, should) be used to not only get a clean interactive documentation of the services, but, more importantly, be used to automatically create code to interface with the services in pretty much any major language.

Note that in order to be able to run the service on the target computer, without any use of an external connection to the web, we still have the problem of getting the necessary resources on the target computer. For this we can use public and private package repositories that can be installed in the standard way and/or containers (e.g. Docker).

py-to-some-platform-independent-declarative-language

py2json is an example of a repository that gathered some ideas (and working utils) around this theme.

Essentially, here, we want to offer the tools to make "coder" and "decoder" functions for classes of objects, such that tests made with validate_codec for specific command and comparison arguments.

The idea here is to generate a specification of the resources that can be stored and/or transmitted, then parsed and interpreted on the target system, and possibly by another language. This is in line with the declarative programming paradigm. There's few important aspects involved here:

py2json tackles some of these problems. For example, its angle on "serialization" is that the target will never have, nor ever need, an equal copy of the source's resources, but instead a "behaviorally equivalent", and that often, in the ML context, the target actually needs a much smaller subset of behaviors (it usually only needs to "run" models, not "train" them). See the discussion about serialization and behavior equivalence in the readme of py2json.

Once we've clarified what the target actually needs, we have the question of how to pin point the (recursive) dependencies of those needs. py2json birthed the footprints module (now moved to i2) to help with that.

serializing_sklearn_estimators shows how one can define, semi-automatically, json serialisers for most of 200+ estimators.

Regarding the question of how the needed resources are communicated, note that this is always a concern in any context. For instance, popular programming languages all have their frameworks to deal with package dependencies (pip in python, npm in node, etc.). In py2http or front when we dispatch functions that need to handle inputs (or outputs) that are not "simple", we need to find means to represent them. The issue here is of the same class.

Let's take a very easy example. Suppose you want to deploy foo with to target:

def foo(func, x):
    return func(x)

where the first argument, func is a callable (e.g. a function).

Let's say that now the target has some equivalent proxy of foo -- we say "equivalent proxy" here, because it could be that the function is in a different language, or even a different interface all together (for example, some GUI).

For example, perhaps the C proxy might look like this:

void foo(void (*func)(void*), void* x) {
    func(x);
}

How then, will the source communicate to the target what func to use? Continuing with our example, what would the target equivalent to the following source python be:

def bar(x):
    return x + 1

def baz(x):
    return x * 2

y = foo(bar, 7)
z = foo(baz, 10)

In the case of front (for GUIs) we can use the crude.py, along with a store containing bar and baz -- or if the user needs to create (and save) their custom functions, we use function factories etc. -- and GUI element is created that sources itself from a "store of functions" so that the user can select the needed function.

One can do something similar in the case of the C target: Use a mapping (static, or dynamic using C function factories) that would allow the C user to point to a proxy of 'bar' or of 'baz' on the C side.

Transpile

There are packages out there that claim to take care of this kind of thing. Give it python and it will create an executable... or C code, or JS code etc.

I don't believe (but am (very) happy to be proven wrong) that it something we can rely on the general case.

But, if we allow ourselves to consider the spectrum of possibilities (where we'll find, on the opposite end of transpiling, solutions resembling the last section) we can get something useful.

To illustrate the point, if we constrain our python to be written using only the simple expressions that can be parsed by code_to_dag (see this example here or here) which only uses functions that have their proxies in the target, then transpilation should be simple and robust.

More relevant resources

pickle

It's the python builtin for serialization and what is used by default by tools (e.g. multiprocess) to communicate python objects between processes.

Out of the box, it is "fragile". One can easily create objects that are un-serializable (e.g. lambda functions and closures), and more scarily, not be able to deserialize something that used to (because the class where the data should be injected "has changed"). If we can get a good handle on these two issues, it would help us a lot going forward, at least as far as python to python is concerned. As far as non-python targets (e.g. JS or C), we could still use pickle, but would need to find pickle-deserializers in target language (e.g. node-jpickle).

The documentation has plenty of information that can enable us to think of the issue correctly whether we customize pickle or make our own serializers.

See Comparison with json section for example.

dill

dill is a module that allows for the serialization and deserialization of Python objects, including functions and classes. It is similar to the built-in pickle module but provides additional features such as the ability to pickle lambdas, class methods and closures, and the ability to pickle functions that are defined interactively in the interpreter.

dill has no third-party dependencies

cloudpickle

Historically, cloudpickle came into our lives because some data scientists couldn't serialize some of their objects, even with dill. It is not clear to me what kinds of things cloudpickle can handle that dill cannot (search results on this are contradictory). In the experience reported to me, cloudpickle was more permissive for serializing, but was heavier and had problems when deserializing in a different python version.

cloudpickle has no third-party dependencies

An example of deployment/provisioning as parametrization

In these examples, the structure is fixed (called "GLUE" in the images) to be a chunker->featurizer->model simple pipeline. These images show how the behavior of the system can change by switching out function and/or parametrization thereof. Obviously, the structure itself could also be changed by simply specifying a different one (e.g. via a DAG specification).

image image