konstantint / SKompiler

A tool for compiling trained SKLearn models into other representations (such as SQL, Sympy or Excel formulas)
MIT License
171 stars 10 forks source link

SQL code for single object not need table #9

Closed IrinaFikh closed 2 years ago

IrinaFikh commented 2 years ago

Can you please change SQL for single object ? There we not need join by id and copy input to table "data", only params to parametrized query (e.g., :x1, :x2, etc.) or placeholders to substitution (e.g., #x1#, #x2#, etc.). Such query is simpler and works very fast (especially parametrized).

konstantint commented 2 years ago

I'm not sure I understood the issue. Could you clarify by example?

IrinaFikh commented 2 years ago

Hi Konstantin, Thank you very much for such fast reaction (and sure for converter :)). See, sometimes you need to apply a model to a single object in near real-time. E.g., a mobile app user has performed actions that can change some its score. The app should change the interaction with this user as quickly as possible. So, since on mobile device exists SQLite you can run SQL there. Now, the app can copy this user data to the table and start your SQL as is (additional time and memory). But this is not necessary (single object), so it's better to use SQL without a table. Existed SQL always contains parameter names, table "data" and its id (and joins by id). For single object you can convert expr to parametrized SQL without table related syntax. E.g. if now: SELECT id, some_calculation(x1,...,xn) FROM data
then will: SELECT some_calculation(:x1,...,:xn).

konstantint commented 2 years ago

The use-case seems a bit unusual to me (because why would you evaluate expressions in SQL if you are not querying a database in the first place), but I understand what you mean.

So here's what you can do "out of the box" currently:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from skompiler import skompile
X, y = load_iris(True)
m = RandomForestClassifier(n_estimators=3, max_depth=3).fit(X, y)
skompile(m.predict).to('sqlalchemy/sqlite', multistage=False, from_obj=None)

Note the from_obj=None. This will produce a query without IDs or the "FROM" statement.

There are two problems here, however, which you might want to address somehow.

1. Escaped identifiers.

If you want your query to include identifiers of the form :x: you might try doing something like that:

skompile(m.predict, [":a:", ":b:", ":c:", ":d:"]).to('sqlalchemy/sqlite', multistage=False, from_obj=None)

However, you'll notice that SQLAlchemy will then quote your identifiers (SELECT ":a:" ...), which is not what you want.

The simple hacky solution is to post-process the string by str-replacing those quoted identifiers to what you want them to be.

A nicer approach would be to disable escaping in the SQL generation logic. This means adding the is_literal=True parameter to the sa.column call here.

Without having to update the current library code, you could introduce the modificatin to a subclass of SQLAlchemyWriter and simply copy those lines as follows:

from skompiler.fromskast.sqlalchemy import SQLAlchemyWriter, Result, to_sql
import sqlalchemy as sa

class CustomSQLAlchemyWriter(SQLAlchemyWriter):
    def Identifier(self, id):
      return Result([sa.column(id.id, is_literal=True)], self.from_obj)

expr = skompile(m.predict, [":a:", ":b:", ":c:", ":d:"])
result = CustomSQLAlchemyWriter(from_obj=None, multistage=False)(expr)
result = sa.select(result.cols, from_obj=result.from_obj)
result = to_sql(result, 'sqlite')

2. multistage=False

The example above currently won't work with multistage=True (i.e. it will generate a lengthy, repetitive, inefficient computation, unlike the multistage=True approach that extracts repetitive pieces into WITH subexpressions). This is a limitation that is, in principle, unnecessary and can be removed by updating the library a bit. The solution might look something like that:

import numpy as np
from skompiler.fromskast._common import id_generator
from skompiler.fromskast.sqlalchemy import SQLAlchemyWriter, Result, to_sql
import sqlalchemy as sa

class CustomSQLAlchemyWriter(SQLAlchemyWriter):
    def __init__(self, from_obj='data', key_column='id',
                 positive_infinity=float(np.finfo('float64').max),
                 negative_infinity=float(np.finfo('float64').min),
                 multistage=True):
        self.positive_infinity = positive_infinity
        self.negative_infinity = negative_infinity
        if isinstance(from_obj, str):
            from_obj = sa.table(from_obj, sa.column(key_column))
            if key_column is not None:
              from_obj.key_ = from_obj.columns[key_column]
            elif multistage:
              raise ValueError("A key column must be provided for multistage mode when from_obj is not None")
        elif from_obj is not None:
            if key_column not in from_obj.columns:
                raise ValueError("The provided selectable does not contain the key column {0}".format(key_column))
            from_obj.key_ = from_obj.columns[key_column]
        self.from_obj = from_obj
        self.temp_ids = id_generator()
        self.references = [{}]
        self.multistage = multistage

    def Identifier(self, id):
      return Result([sa.column(id.id, is_literal=True)], self.from_obj)

    def _make_cte(self, result, col_names=None, key_label='__id__'):
      if col_names is None:
          col_names = ['f{0}'.format(i+1) for i in range(len(result.cols))]
      labeled_cols = [c.label(n) for c, n in zip(result.cols, col_names)]
      if result.from_obj is None:
        new_tbl = sa.select(labeled_cols).cte(next(self.temp_ids))
        new_tbl.key_ = None
      if result.from_obj is None or result.from_obj.key_ is None:
        new_tbl = sa.select(labeled_cols, from_obj=result.from_obj).cte(next(self.temp_ids))
        new_tbl.key_ = None
      else:
        new_tbl = sa.select([result.from_obj.key_.label(key_label)] + labeled_cols, from_obj=result.from_obj).cte(next(self.temp_ids))
        new_tbl.key_ = new_tbl.columns[key_label]
      new_cols = [new_tbl.columns[n] for n in col_names]
      return Result(new_cols, new_tbl)

expr = skompile(m.predict, [":a:", ":b:", ":c:", ":d:"])
result = CustomSQLAlchemyWriter(from_obj=None, multistage=True)(expr)
result = sa.select(result.cols, from_obj=result.from_obj)
result = to_sql(result, 'sqlite')

I'll consider updating the code to allow this functionality a bit later (this still needs tests & docs which are often the more time-consuming part), but so far see if the "patching" approach helps you for now.

IrinaFikh commented 2 years ago

Konstantin, thanks! "why would you evaluate expressions in SQL if you are not querying a database" - there is very simple reason :) App will be created with unity (c#), and c# not has embedded very fast and lightweight component to run script or dynamic func. So I use SQLite only as crossplatform eval func ))

See timing e.g.: Model random_forest_no_parametrized_sql Result 2 Calculation duration 0.0617237 sec

Model random_forest_parametrized_sql Result 2 Calculation duration 0.0013169 sec

So, we have there real time by very simple way, thank you very much for SKompiler

konstantint commented 2 years ago

I see. What you could do to make evaluation even faster, however, is compile the model directly to C#. If you only care about random forests, where the generated code is pretty simplistic (it's always a combination if if-then-else blocks and simple arithmetics), you could perhaps adapt the following example to your needs:

from skompiler.fromskast._common import ASTProcessor, StandardOps, VectorsAsLists, StandardArithmetics, denumpyfy

def _unop(name):
    return lambda self, x: (lambda a: f'{name}({a})')

def _binop(name):
    return lambda self, x: (lambda a,b: f'{name}({a}, {b})')

def _infixop(name):
    return lambda self, x: (lambda a,b: f'({a} {name} {b})')

def _unimpl(self, x):
  return '<unimpl>'

class MyCompiler(ASTProcessor, StandardOps, VectorsAsLists, StandardArithmetics):
    def Identifier(self, name):
        return name.id
    VectorIdentifier = Identifier

    def IndexedIdentifier(self, sub):
      return f'{sub.id}[{sub.index}]'

    def NumberConstant(self, num):
        return str(num.value)

    def VectorConstant(self, vec):
        return f"({','.join([str(denumpyfy(el)) for el in vec.value])})"

    def MakeVector(self, mv):
        return f"{','.join([self(el) for el in vec.value])}"

    def IfThenElse(self, node):
        return f"lib.If({self(node.test)}, {self(node.iftrue)}, {self(node.iffalse)})"

    TypedReference = Reference = Let = MatrixConstant = _unimpl

    # Functions
    Exp = _unop('lib.Exp')
    Sqrt = _unop('lib.Sqrt')
    Log = _unop('lib.Log')
    Step = _unop('lib.Step')
    VecSum = _unop('lib.Sum')
    ArgMax = _unop('lib.ArgMax')
    Sigmoid = _unop('lib.Sigmoid')
    Softmax = _unop('lib.SoftMax')
    VecMax = _unop('lib.VecMax')
    Max = _unop('lib.Max')
    Abs = _unop('lib.Abs')

    # Operators
    Mul = _infixop('*')
    Div = _infixop('/')
    Add = _infixop('+')
    Sub = _infixop('-')
    USub = _unop('-')
    DotProduct = _binop('lib.Dot')
    MatVecProduct = _binop('lib.MatVec')

    # Predicates
    LtEq = _infixop('<=')
    Eq = _binop('==')

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from skompiler import skompile
X, y = load_iris(True)
m = RandomForestClassifier(n_estimators=3, max_depth=3).fit(X, y)
result = MyCompiler()(skompile(m.predict))
IrinaFikh commented 2 years ago

Hi again, Thank for code snippet. I will need both classification and forecasting, with/without teatcher, and I will compare different models. Then models will be relearned by schedule, e.g.daily, so, same problem with dynamic funcs...

Apropos, we talked about client side. What if we want apply model to DB with data on server side ? We need think about next queries: SELECT [id columns], score FROM [table_name]; // For all records SELECT [id columns], score FROM [table_name] WHERE [condition]; // / Returns subset of records SELECT score FROM [table_name] WHERE [condition]; // Returns <= 1 records, condition: id column_1 = value_1, ..., id column_last = value_last [id_columns] - since id can be composite.

It's not difficult to tune SCompiler output, but maybe some parts of this implemented already in dSCompiler ?

konstantint commented 2 years ago

If you want to query the database "normally" then from_obj is the thing you want to use. You can provide any valid SQLAlchemy selectable expression there - it can be a table or perhaps a CTE where you can add your WHERE filtering as needed.

Multi-column keys are currently not supported indeed - perhaps you could work around this by combining your key columns into one (and passing it in the from_obj as the single key column).

konstantint commented 2 years ago

(Also note that the output of skompile(..., 'sqlalchemy') is itself an SQLAlchemy expression which you can continue building further by joining other data, filtering, etc)

IrinaFikh commented 2 years ago

Thanks, great feature

konstantint commented 2 years ago

In case it helps, here's an example of building the from_obj:

print(skompile(m.predict).to('sqlalchemy/sqlite',
                             from_obj=sa.select(map(sa.column, ['id', 'x1', 'x2', 'x3', 'x4']), 
                                                from_obj=sa.table('somedata')).
                                          where(sa.text('id > 12')).cte('inputs')))
IrinaFikh commented 2 years ago

Thanks!