sql-machine-learning / sqlflow

Brings SQL and AI together.
https://sqlflow.org
Apache License 2.0
5.09k stars 699 forks source link

[Discussion] Save a SQL program ( or pipeline ) for future use #2108

Open typhoonzero opened 4 years ago

typhoonzero commented 4 years ago

Background

Consider below SQL program:

CREATE TABLE train_table AS SELECT * FROM origin_data WHERE date<'yestoday'
SELECT * FROM train_table TO TRAIN SomeModel ...
SELECT * FROM new_data_table WHERE date='today' TO PREDICT predict_result.class ...
DROP TABLE train_table;

There are cases that we need to invoke this SQL program with similar but different origin_data and predict_result, or invoke this SQL snippet multiple times in a SQL program.

Proposal

  1. Write any Python function:
    import sqlflow
    import feature_process_lib
    def sqlflow_run(self, tmp_table, arg1=1, arg2=2):
        feature_process_lib.process(table=tmp_table, transform="norm|hash")
        sqlflow.exec("CREATE TABLE train_table AS SELECT * FROM %s ..." % tmp_table)
        sqlflow.exec("SELECT ... TO TRAIN ...")
            ...

    To use the python function: SELECT * FROM my_raw_table TO RUN sqlflow_run WITH arg1=3, arg2=5

  2. Implement CREATE PROCEDURE or CREATE FUNCTION to define the SQL snnipet as a callable procedure or function, like: https://dev.mysql.com/doc/refman/8.0/en/create-procedure.html
  3. Be able to save and call a SQL snnipet as a template: ${CALL TEMPLATE MyPipeLine arg1 arg2 ...}, expand the template call to a SQL snnipt when executing the current SQL program. And, the template can be saved when editing the SQL program on Dataworks.
shendiaomo commented 4 years ago

I think it is great that SQLFlow supports stored procedures. But where would we store a "stored procedure"? Do we have to design an access control mechanism for it?

typhoonzero commented 4 years ago

Updated proposal one, and we decided to implement proposal one.

wangkuiyi commented 4 years ago

The Facts

  1. It is too challenging for SQLFlow to be able to encapsulate a SQL program into a SQL procedure. This would make the SQLFlow parser tremendously complex. We cannot afford the engineering cost.

  2. It is possible to encapsulate a complex process in a Python function and enable SQLFlow to call the Python function.

The Design

Consider the following Python function

def a_python_func(iter, param1, param2):
   for row in iter:
      print(row)

We hope that by packing it into a Docker image cxwangyi/procedures, we can call it from SQLFlow with the following new SQL syntax extension:

SELECT * FROM tbl 
   TO RUN cxwangyi/procedure:a_python_fuc 
   WITH param1=100, param2="hello";

The first parameter iter of a_python_func iterates rows returned by the query SELECT * FROM tbl. The rest parameters get their values from the WITH clause.