konstantint / SKompiler

A tool for compiling trained SKLearn models into other representations (such as SQL, Sympy or Excel formulas)
MIT License
171 stars 10 forks source link

SKompiler: Translate trained SKLearn models to executable code in other languages

Build Status

The package provides a tool for transforming trained SKLearn models into other forms, such as SQL queries, Excel formulas, Portable Format for Analytics (PFA) files or Sympy expressions (which, in turn, can be translated to code in a variety of languages, such as C, Javascript, Rust, Julia, etc).

Requirements

Installation

The simplest way to install the package is via pip:

$ pip install SKompiler[full]

Note that the [full] option includes the installations of sympy, sqlalchemy and astor, which are necessary if you plan to convert SKompiler's expressions to sympy expressions (which, in turn, can be compiled to many other languages) or to SQLAlchemy expressions (which can be further translated to different SQL dialects) or to Python source code. If you do not need this functionality (say, you only need the raw SKompiler expressions or perhaps only the SQL conversions without the sympy ones), you may avoid the forced installation of all optional dependencies by simply writing

$ pip install SKompiler

(you are free to install any of the required extra dependencies, via separate calls to pip install, of course)

Usage

Introductory example

Let us start by walking through an introductory example. We begin by training a model on a small dataset:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
X, y = load_iris(return_X_y=True)
m = RandomForestClassifier(n_estimators=3, max_depth=3).fit(X, y)

Suppose we need to express the logic of m.predict in SQLite. Here is how we can achieve that:

from skompiler import skompile
expr = skompile(m.predict)
sql = expr.to('sqlalchemy/sqlite')

Voila, the value of the sql variable is a query, which would compute the value of m.predict in pure SQL:

WITH _tmp1 AS
(SELECT .... FROM data)
_tmp2 AS
( ... )
SELECT ... from _tmp2 ...

Let us import the data into an in-memory SQLite database to test the generated query:

import sqlalchemy as sa
import pandas as pd
conn = sa.create_engine('sqlite://').connect()
df = pd.DataFrame(X, columns=['x1', 'x2', 'x3', 'x4']).reset_index()
df.to_sql('data', conn)

Our database now contains the table named data with the primary key index. We need to provide this information to SKompiler to have it generate the correct query:

sql = expr.to('sqlalchemy/sqlite', key_column='index', from_obj='data')

We can now query the data:

results = pd.read_sql(sql, conn)

and verify that the results match:

assert (results.values.ravel() == m.predict(X).ravel()).all()

Note that the generated SQL expression uses names x1, x2, x3 and x4 to refer to the input variables. We could have chosen different input variable names by writing:

expr = skompile(m.predict, ['a', 'b', 'c', 'd'])

Single-shot computation

Note that the generated SQL code splits the computation into sequential steps using with expressions. In some cases you might want to have the whole computation "inlined" into a single expression. You can achieve this by specifying multistage=False:

sql = expr.to('sqlalchemy/sqlite', multistage=False)

Note that in this case the resulting expression would typically be several times longer than the multistage version:

len(expr.to('sqlalchemy/sqlite'))
> 2262
len(expr.to('sqlalchemy/sqlite', multistage=False))
> 12973

Why so? Because, for a typical classifier (including the one used in this example)

predict(x) = argmax(predict_proba(x))

There is, however, no single argmax function in SQL, hence it has to be faked using the following logic:

predict(x) = if predict_proba(x)[0] == max(predict_proba(x)) then 0
                else if predict_proba(x)[1] == max(predict_proba(x)) then 1
                else 2

If SKompiler is not alowed to use a separate step to store the intermediate predict_proba outputs, it is forced to inline the same computation verbatim multiple times. To summarize, you should probably avoid the use of multistage=False in most cases.

Other formats

By changing the first parameter of the .to() call you may produce output in a variety of other formats besides SQLite:

Other models

So far this has been a fun two-weekends-long project, hence translation is implemented for a limited number of models. The most basic ones (linear models, decision trees, forests, gradient boosting, PCA, KMeans, MLP, Pipeline and a couple of preprocessors) are covered, however, and this is already sufficient to compile nontrivial constructions. For example:

m = Pipeline([('scale', StandardScaler()),
              ('dim_reduce', PCA(6)),
              ('cluster', KMeans(10)),
              ('classify', MLPClassifier([5, 4], 'tanh'))])

Even though this particular example probably does not make much sense from a machine learning perspective, it would happily compile both to Excel and SQL forms none the less.

How it works

The skompile procedure translates a given method into an intermediate syntactic representation (called SKompiler AST or SKAST). This representation uses a limited number of operations so it is reasonably simple to translate it into other forms.

In principle, SKAST's utility is not limited to sklearn models. Anything you translate into SKAST becomes automatically compileable to whatever output backends are implemented in SKompiler. Generating raw SKAST is quite straightforward:

from skompiler.dsl import ident, const
expr = const([[1,2],[3,4]]) @ ident('x', 2) + 12
expr.to('sqlalchemy/sqlite', 'result')
> SELECT 1 * x1 + 2 * x2 + 12 AS result1, 3 * x1 + 4 * x2 + 12 AS result2 
> FROM data

You can use repr(expr) on any SKAST expression to dump its unformatted internal representation for examination or str(expr) to get a somewhat-formatted view of it.

It is important to note, that for larger models (say, a random forest or a gradient boosted model with 500+ trees) the resulting SKAST expression tree may become deeper than Python's default recursion limit of 1000. As a result some translators may produce a RecursionError when processing such expressions. This can be solved by raising the system recursion limit to sufficiently high value:

import sys
sys.setrecursionlimit(10000)

Development

If you plan to develop or debug the package, consider installing it by running:

$ pip install -e .[dev]

from within the source distribution. This will install the package in "development mode" and include extra dependencies, useful for development.

You can then run the tests by typing

$ py.test

at the root of the source distribution.

Contributing

Feel free to contribute or report issues via Github:

Copyright & License

Copyright: 2018, Konstantin Tretyakov. License: MIT