sql-machine-learning / sqlflow

Brings SQL and AI together.
https://sqlflow.org
Apache License 2.0
5.09k stars 698 forks source link

Rethink about SQLFlow #2319

Open wangkuiyi opened 4 years ago

wangkuiyi commented 4 years ago

A complete rethinking of a tool starts with the requirements of its users.

User Requirements

SQLFlow users want

  1. execute a program in batch, and
  2. execute a program statement-by-statement in REPL mode or debug mode.

Compiler and Executor Are for Statements, Not Programs

The key is to execute the statement. To do that, we need

  1. a compiler: plan = compile(statement), and
  2. an executor: execute(plan).

To enable users to use the compiler and the executor, the simplest form of UI is command-line tools. So users can run run

sqlflow_compiler options < a.sql > a.plan
sqlflow_executor options < a.plan

They could even run the two commands in with a pipe

cat a.sql | sqlflow_compiler options | sqlflow_executor options

Indeed, MySQL compiler handles statements but not program either.

Concurrent Execution of Statements

Yes, we can analyze dependencies between statements in a program, and try to run statements in parallel. But this is not the work of the above compiler and executor. This is the work of a sqlflow_program_executor.

The trivial implementation could be a bash script.

for statement in $(cat a.sql); do
    echo $statement | sqlflow_compiler | sqlflow_executor;
done

SQLFlow Server is Not Must-to-Have

As a DBMS enhancement, SQLFlow needs to serve multiple users. This can be done by providing a server like MySQL server, but a server is not necessary. Users can run the above pipe-command in their laptop, or a container or a VM closer to the production environment. In such cases, they can all use SQLFlow without an SQLFlow server.

Connect to GUI

We support and considered to support the following GUI systems:

  1. Jupyter
  2. command-line REPL
  3. Alibaba DataWorks
  4. Visual Studio Code

All of them support running a program and running statements one-by-one.

To make them working with SQLFlow, we need to write some form of plugins for them. The plugin can do its work as long as it can run the above commands sqlflow_compiler and sqlflow_executor. No need for a server.

typhoonzero commented 4 years ago

As described here: https://en.wikipedia.org/wiki/Client%E2%80%93server_model

Client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients.

Currently, SQLFlow does provide resources:

And SQLFlow links with other public/user-specific resources:

If the user tries to execute cat a.sql | sqlflow_compiler options | sqlflow_executor options locally, they must provide their own Kubernetes and OSS credentials in sqlflow_executoer options.

wangkuiyi commented 4 years ago

If the user tries to execute cat a.sql | sqlflow_compiler options | sqlflow_executor options locally, they must provide their own Kubernetes and OSS credentials in sqlflow_executoer options.

@typhoonzero Agreed. It seems that sqlflow_executor is the hub to connect to DBMSes, Kubernetes, storage services, and the Docker registry. Am I correct?

brightcoder01 commented 4 years ago

Besides two command line tools, we also need a kubernetes cluster (or minikube) and workflow engine installed in it.

Let's take the following SQL statement for example:

SELECT *
FROM census_income
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 20]
COLUMN NORMALIZE(capital_gain), STANDARDIZE(age), EMBEDDING(hours_per_week, dimension=32)
LABEL label

It will be translated into a workflow containing three steps:

  1. Write the data from SELECT * FROM census_income into temporary table
  2. Do auto analysis: generate and execute the analysis SQL. (Maybe long time running)
  3. Generate the transform code and submit training job. (Long time running)

This workflow can't be executed in the process of sqlflow_executor. It's better to execute in the workflow engine inside kubernetes.

typhoonzero commented 4 years ago

@typhoonzero Agreed. It seems that sqlflow_executor is the hub to connect to DBMSes, Kubernetes, storage services, and the Docker registry. Am I correct?

@wangkuiyi yes!

wangkuiyi commented 4 years ago

Besides two command-line tools, we also need a Kubernetes cluster (or minikube) and workflow engine installed in it.

@brightcoder01 Yes. Or, more precisely, if sqlflow_compiler generates an Argo YAML file, then sqlflow_executor could simply call kubectl to run it.

Optionally, we can also use Python or other workflow execution engine.

We need a solid logical step(s) to derive that we should use Kubernetes/Argo or something else as the workflow execution engine.

lhw362950217 commented 4 years ago

I agree that the core of SQLFlow may be composed by the sqlflow_compiler and sqlflow_executor and maybe other binary modules. These tools can be delivered to end user and be executed at a place we can't touch. However if we want to collection global information such as usage data, auto bug reporting, binary upgrade info checking, we need another centralized server.

Another question is: In workflow mode, do we call the tool that submit the Kubernetes jobs as the sqlflow_executor or the bin that really execute the SQL statement (in the Kubernetes cluster) as the sqlflow_executor ? Is there a missing concept sqlflow_submitter which manage the submitting and monitor the execution like below:

local mode: cat a.sql | sqlflow_compiler options | sqlflow_executor options
workflow mode: cat a.sql | sqlflow_compiler options | sqlflow_submitter options --not a pipe--> sqlflow_executor options

When I first looked into the workflow mode, I got confused. Even now I'm not clear for above question.

wangkuiyi commented 4 years ago

I agree that the core of SQLFlow may be composed by the sqlflow_compiler and sqlflow_executor and maybe other binary modules. These tools can be delivered to end user and be executed at a place we can't touch. However if we want to collection global information such as usage data, auto bug reporting, binary upgrade info checking, we need another centralized server.

@lhw362950217 command-line tools can send user behaviors to a statistical server such as Google Analytics as well. And yes, we can build a server as a thin layer, which calls the above two command-line tools or the library beneath them. This server can log user behaviors for statistics.

When I first looked into the workflow mode, I got confused. Even now I'm not clear for above question.

@lhw362950217 I am not sure if we need Tekton mode and a local mode. Or, it is OK that we only have the Tekton mode?


OK @typhoonzero reminded me that we do need Kubernetes+Tekton as an immortal OS+executor. I agree with @typhoonzero .

shendiaomo commented 4 years ago

sqlflow_compiler and sqlflow_executor. No need for a server.

From a compiler architecture's point of view, I agree this idea will help us to achieve a clearer architecture. Here's a rough conceptual correspondence.

Language Compiler Target Platform Runtime
C++ g++ Linux/Windows libstdc++.so/libgcc_s.so
Java javac JVM JRE
Python interpreter Python Virtual Machine CPython/Jython/PyPy
SQLFlow? sqlflowc Kubernetes Tecton/Argo/EDL and all other neccesary dependencies

The idea of adapting SQLFlow to a standard compiler architecture has other benefits: such a system is much easier to be incorporated into other systems. For example:

  1. alisa: we can deploy SQLFlow compiler into alisa-like systems as what alisa does with odpscmd, thus easily getting necessary credentials.
  2. We don't have to maintain a server cluster anymore.

This only problem is that the dependency of the SQLFlow compiler itself is a mess: it depends on kubectl, JRE, python etc and lots of configures. We have to find a clear way to make the installation, upgrading, and deployment of the compiler easier before diving into developing.

shendiaomo commented 4 years ago

p. s. In an initial version of the SQLFlow arXiv paper, I wrote “We designed SQLFlow as a cross compiler with a Kubernetes-native runtime”,there's a question I can't figure the answer out. The question was:

SQL is not a general-purpose language (actually a DSL), it chose a very different architecture from the compiler's architecture of general-purpose languages. SQLFlow is also a DSL, why should it have to be in a traditional compiler architecture?

I think we'd better get a clear answer to this question.

sneaxiy commented 4 years ago

sqlflow_compiler and sqlflow_executor. No need for a server.

I agree. The server is not necessary exactly. But we have mixed up compiler and executor in our implementation.

In the traditional compiler architecture, the compiler does the following steps:

Or, more precisely, if sqlflow_compiler generates an Argo YAML file, then sqlflow_executor could simply call kubectl to run it. Optionally, we can also use Python or other workflow execution engine. This only problem is that the dependency of the SQLFlow compiler itself is a mess: it depends on kubectl, JRE, python etc and lots of configures.

Generating Argo YAML, or using Python or other workflow execution engine is similar to the lowering process of compiler. SQLFlow looks like a cross compiler when using Argo YAML. The cross compiling process does not need to compile on the real target platform, say, SQLFlow compiler itself may not depend on kubectl, JRE or python. And the SQLFlow executor should only concentrate the running process itself, but not IR translation or lowering.

wangkuiyi commented 4 years ago

I agree. The server is not necessary exactly. But we have mixed up compiler and executor in our implementation.

@sneaxiy Good points. Let me provide more connecting information.

In my previous comments, I used sqlflow_executor as a general concept representing the capability of executing the compilation target program.

In successive discussions, @typhoonzero reminded us that we need an immortal executor, which is preferably Kubernetes+Tekton. It is like after compiling a Go program into a binary, we run the binary, in the hope that Linux kernel doesn't crash -- Linux kernel is the (hopefully) immortal executor.

When we use Kubernetes+Tekton as the executor, we don't really need a command-line program sqlflow_executor, instead, as you said, we need a submitter program, which is kubectl.

Generating Argo YAML, or using Python or other workflow execution engine is similar to the lowering process of compilers.

Yes, SQLFlow is a cross compiler. It runs on Linux, which is a node OS, and the generated program runs on Kubernetes, which is a cluster OS.

wangkuiyi commented 4 years ago

Summarizing the above discussions:

We need an SQLFlow statement compiler sqlflow_compiler, which can be the simplest form -- a command-line tool that takes an SQLFlow statement and outputs a Tekton workflow in a YAML file.


Users want to use various REPL UI, including sqlflow REPL command-line tool, Jupyter Notebook, Alibaba DataWorks, and Visual Studio Code. To connect these UIs to SQLFlow, we need to program plugins for them. These plugins need to be able to execute statements or programs.


To split a program into statements, we need the Java gRPC parser server. It is not easy to deploy the server together with the plugins, so it would be easier to have a SQLFlow server container.


SQLFlow server container contains:

The calling dependencies are as the follows

program executor -> Java gRPC parser server
sqlflow_compiler ---/
kubectl

Especially, the Java gRPC parser server might need to load proprietary Java jars, which cannot be distributed together with sqlflow_compiler.


Because the SQLFlow server container must provide a server, the program executor should be a gRPC server.

wangkuiyi commented 4 years ago

The Python implementation provides a package

sqlflow

which includes the following functions

Each function call equals to a SQL statement.


Consider a user program

import sqlflow
sqlflow.train("SELECT ...", ...)
sqlflow.predict("SELECT ...", ...)

We want sqlflow.train works like sqlflow_compiler to generate a YAML and calls kubectl to execute this YAML.

Yancey1989 commented 4 years ago

sqlflow_compiler compile SQL statements to a.plan:

import sqlflow.workflow as sqlflow
sqlflow.train("SELECT ...", ...)
sqlflow.predict("SELEC ...", ...)

sqlflow.train, sqlflow.predict generates one or more step YAML code and fill the workflow YAML file.

For each workflow step, it submits a SQL job or an AI job.

import sqlflow.step as sqlflow
sqlflow.query("CREATE TABLE tmp_tbl AS ...")

submit to DBMS to execute the SQL.

or

import sqlflow.step as sqlflow
sqlflow.train("SELECT * FROM tmp_tbl ...")

generate the Tensorflow code and submit to PAI.

typhoonzero commented 4 years ago

@Yancey1989 maybe better to write:

from sqlflow import workflow
workflow.train("SELECT ...", ...)
workflow.predict("SELEC ...", ...)

and

from sqlflow import step
step.train("SELECT * FROM tmp_tbl ...")
brightcoder01 commented 4 years ago

Agree with @Yancey1989 @typhoonzero . I think we may have two python API sets.

Add one question:Can user use this Python API set directly? Such as write python program with this API set in VSCode/Dataworks/Jupyter Notebook. If not, do we need provide a third Python API set to be used directly by users?

  1. sqlflow.analysis: generate some analysis sql and execute the sql to calculate the statistical results inside the step container;
  2. sqlflow.submit_train: generate the transform code with the analysis result and submit the training job to PAI/Kubernetes.
  3. sqlflow.derive_features: derive the features and the transformation logic according to the table schema (how many columns the table contains, and each column is numerical or categorical). For categorical column, do some auto analysis (such as calculate the distinct count) to decide whether we use VOCABULARIZE or HASH to convert the categorical column to id.
typhoonzero commented 4 years ago

Note that the "compiler" means to compile a SQL program into a workflow (Tekton YAML), the "executor" is used to execute the workflow.

The compiler consists parser, IR resolver, workflow generator. Since parser and IR resolver is written in Go, the workflow generator seems better in Go.

The executor consists workflow engine, code generator to generate step Python code like sqlflow.train.

And, we'd like to make the SQLFlow server "stateless", then submitting the workflow YAML and fetching the status should be done in Go. So:

lhw362950217 commented 4 years ago

@shendiaomo @Yancey1989 @typhoonzero It seems that we need two levels of Python API, the high level API talks to SQLFlow server to generate an execution plan, it just another client. The low level API does the real execution of SQL query or AI job. Should we do the compile thing in the low level like before (in step)? I'd prefer not, am I right?