Open wangkuiyi opened 4 years ago
As described here: https://en.wikipedia.org/wiki/Client%E2%80%93server_model
Client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients.
Currently, SQLFlow does provide resources:
And SQLFlow links with other public/user-specific resources:
If the user tries to execute cat a.sql | sqlflow_compiler options | sqlflow_executor options
locally, they must provide their own Kubernetes and OSS credentials in sqlflow_executoer options
.
If the user tries to execute
cat a.sql | sqlflow_compiler options | sqlflow_executor options
locally, they must provide their own Kubernetes and OSS credentials insqlflow_executoer options
.
@typhoonzero Agreed. It seems that sqlflow_executor
is the hub to connect to DBMSes, Kubernetes, storage services, and the Docker registry. Am I correct?
Besides two command line tools, we also need a kubernetes cluster (or minikube) and workflow engine installed in it.
Let's take the following SQL statement for example:
SELECT *
FROM census_income
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 20]
COLUMN NORMALIZE(capital_gain), STANDARDIZE(age), EMBEDDING(hours_per_week, dimension=32)
LABEL label
It will be translated into a workflow containing three steps:
SELECT * FROM census_income
into temporary tableThis workflow can't be executed in the process of sqlflow_executor
. It's better to execute in the workflow engine inside kubernetes.
@typhoonzero Agreed. It seems that sqlflow_executor is the hub to connect to DBMSes, Kubernetes, storage services, and the Docker registry. Am I correct?
@wangkuiyi yes!
Besides two command-line tools, we also need a Kubernetes cluster (or minikube) and workflow engine installed in it.
@brightcoder01 Yes. Or, more precisely, if sqlflow_compiler
generates an Argo YAML file, then sqlflow_executor
could simply call kubectl
to run it.
Optionally, we can also use Python or other workflow execution engine.
We need a solid logical step(s) to derive that we should use Kubernetes/Argo or something else as the workflow execution engine.
I agree that the core of SQLFlow may be composed by the sqlflow_compiler
and sqlflow_executor
and maybe other binary modules. These tools can be delivered to end user and be executed at a place we can't touch. However if we want to collection global information such as usage data, auto bug reporting, binary upgrade info checking, we need another centralized server.
Another question is: In workflow mode, do we call the tool that submit the Kubernetes jobs as the sqlflow_executor
or the bin that really execute the SQL statement (in the Kubernetes cluster) as the sqlflow_executor
? Is there a missing concept sqlflow_submitter
which manage the submitting and monitor the execution like below:
local mode: cat a.sql | sqlflow_compiler options | sqlflow_executor options
workflow mode: cat a.sql | sqlflow_compiler options | sqlflow_submitter options --not a pipe--> sqlflow_executor options
When I first looked into the workflow mode, I got confused. Even now I'm not clear for above question.
I agree that the core of SQLFlow may be composed by the
sqlflow_compiler
andsqlflow_executor
and maybe other binary modules. These tools can be delivered to end user and be executed at a place we can't touch. However if we want to collection global information such as usage data, auto bug reporting, binary upgrade info checking, we need another centralized server.
@lhw362950217 command-line tools can send user behaviors to a statistical server such as Google Analytics as well. And yes, we can build a server as a thin layer, which calls the above two command-line tools or the library beneath them. This server can log user behaviors for statistics.
When I first looked into the workflow mode, I got confused. Even now I'm not clear for above question.
@lhw362950217 I am not sure if we need Tekton mode and a local mode. Or, it is OK that we only have the Tekton mode?
OK @typhoonzero reminded me that we do need Kubernetes+Tekton as an immortal OS+executor. I agree with @typhoonzero .
sqlflow_compiler and sqlflow_executor. No need for a server.
From a compiler architecture's point of view, I agree this idea will help us to achieve a clearer architecture. Here's a rough conceptual correspondence.
Language | Compiler | Target Platform | Runtime |
---|---|---|---|
C++ | g++ | Linux/Windows | libstdc++.so/libgcc_s.so |
Java | javac | JVM | JRE |
Python | interpreter | Python Virtual Machine | CPython/Jython/PyPy |
SQLFlow? | sqlflowc | Kubernetes | Tecton/Argo/EDL and all other neccesary dependencies |
The idea of adapting SQLFlow to a standard compiler architecture has other benefits: such a system is much easier to be incorporated into other systems. For example:
alisa
: we can deploy SQLFlow compiler into alisa
-like systems as what alisa
does with odpscmd
, thus easily getting necessary credentials.This only problem is that the dependency of the SQLFlow compiler itself is a mess: it depends on kubectl
, JRE
, python
etc and lots of configures. We have to find a clear way to make the installation, upgrading, and deployment of the compiler easier before diving into developing.
p. s. In an initial version of the SQLFlow arXiv paper, I wrote “We designed SQLFlow as a cross compiler with a Kubernetes-native runtime”,there's a question I can't figure the answer out. The question was:
SQL is not a general-purpose language (actually a DSL), it chose a very different architecture from the compiler's architecture of general-purpose languages. SQLFlow is also a DSL, why should it have to be in a traditional compiler architecture?
I think we'd better get a clear answer to this question.
sqlflow_compiler and sqlflow_executor. No need for a server.
I agree. The server is not necessary exactly. But we have mixed up compiler and executor in our implementation.
In the traditional compiler architecture, the compiler does the following steps:
Or, more precisely, if sqlflow_compiler generates an Argo YAML file, then sqlflow_executor could simply call kubectl to run it. Optionally, we can also use Python or other workflow execution engine. This only problem is that the dependency of the SQLFlow compiler itself is a mess: it depends on kubectl, JRE, python etc and lots of configures.
Generating Argo YAML, or using Python or other workflow execution engine is similar to the lowering process of compiler. SQLFlow looks like a cross compiler when using Argo YAML. The cross compiling process does not need to compile on the real target platform, say, SQLFlow compiler itself may not depend on kubectl, JRE or python. And the SQLFlow executor should only concentrate the running process itself, but not IR translation or lowering.
I agree. The server is not necessary exactly. But we have mixed up compiler and executor in our implementation.
@sneaxiy Good points. Let me provide more connecting information.
In my previous comments, I used sqlflow_executor
as a general concept representing the capability of executing the compilation target program.
In successive discussions, @typhoonzero reminded us that we need an immortal executor, which is preferably Kubernetes+Tekton. It is like after compiling a Go program into a binary, we run the binary, in the hope that Linux kernel doesn't crash -- Linux kernel is the (hopefully) immortal executor.
When we use Kubernetes+Tekton as the executor, we don't really need a command-line program sqlflow_executor
, instead, as you said, we need a submitter program, which is kubectl
.
Generating Argo YAML, or using Python or other workflow execution engine is similar to the lowering process of compilers.
Yes, SQLFlow is a cross compiler. It runs on Linux, which is a node OS, and the generated program runs on Kubernetes, which is a cluster OS.
Summarizing the above discussions:
We need an SQLFlow statement compiler sqlflow_compiler
, which can be the simplest form -- a command-line tool that takes an SQLFlow statement and outputs a Tekton workflow in a YAML file.
We rely on Tekton and Kubernetes because we need an immortal executor. See https://github.com/sql-machine-learning/sqlflow/issues/2319#issuecomment-631899713 for details.
We can use kubectl
as the sqlflow_executor
command-line tool mentioned in https://github.com/sql-machine-learning/sqlflow/issues/2319#issue-622153226
Users want to use various REPL UI, including sqlflow
REPL command-line tool, Jupyter Notebook, Alibaba DataWorks, and Visual Studio Code. To connect these UIs to SQLFlow, we need to program plugins for them. These plugins need to be able to execute statements or programs.
To split a program into statements, we need the Java gRPC parser server. It is not easy to deploy the server together with the plugins, so it would be easier to have a SQLFlow server container.
SQLFlow server container contains:
program executor
, which splits a program into statements by calling Java gRPC parser server.sqlflow_compiler
kubectl
The calling dependencies are as the follows
program executor -> Java gRPC parser server
sqlflow_compiler ---/
kubectl
Especially, the Java gRPC parser server might need to load proprietary Java jars, which cannot be distributed together with sqlflow_compiler
.
Because the SQLFlow server container must provide a server, the program executor
should be a gRPC server.
The Python implementation provides a package
sqlflow
which includes the following functions
sqlflow.train
sqlflow.predict
sqlflow.evaluate
sqlflow.explain
sqfllow.optimize
sqlflow.run
Each function call equals to a SQL statement.
Consider a user program
import sqlflow
sqlflow.train("SELECT ...", ...)
sqlflow.predict("SELECT ...", ...)
We want sqlflow.train
works like sqlflow_compiler
to generate a YAML and calls kubectl
to execute this YAML.
sqlflow_compiler
compile SQL statements to a.plan
:
import sqlflow.workflow as sqlflow
sqlflow.train("SELECT ...", ...)
sqlflow.predict("SELEC ...", ...)
sqlflow.train
, sqlflow.predict
generates one or more step YAML code and fill the workflow YAML file.
For each workflow step, it submits a SQL job or an AI job.
import sqlflow.step as sqlflow
sqlflow.query("CREATE TABLE tmp_tbl AS ...")
submit to DBMS to execute the SQL.
or
import sqlflow.step as sqlflow
sqlflow.train("SELECT * FROM tmp_tbl ...")
generate the Tensorflow code and submit to PAI.
@Yancey1989 maybe better to write:
from sqlflow import workflow
workflow.train("SELECT ...", ...)
workflow.predict("SELEC ...", ...)
and
from sqlflow import step
step.train("SELECT * FROM tmp_tbl ...")
Agree with @Yancey1989 @typhoonzero . I think we may have two python API sets.
sqlflow.train
, sqlflow.predict
.Add one question:Can user use this Python API set directly? Such as write python program with this API set in VSCode/Dataworks/Jupyter Notebook. If not, do we need provide a third Python API set to be used directly by users?
sqlflow.analysis
: generate some analysis sql and execute the sql to calculate the statistical results inside the step container;sqlflow.submit_train
: generate the transform code with the analysis result and submit the training job to PAI/Kubernetes.sqlflow.derive_features
: derive the features and the transformation logic according to the table schema (how many columns the table contains, and each column is numerical or categorical). For categorical column, do some auto analysis (such as calculate the distinct count
) to decide whether we use VOCABULARIZE
or HASH
to convert the categorical column to id.Note that the "compiler" means to compile a SQL program into a workflow (Tekton YAML), the "executor" is used to execute the workflow.
The compiler consists parser, IR resolver, workflow generator. Since parser and IR resolver is written in Go, the workflow generator seems better in Go.
The executor consists workflow engine, code generator to generate step Python code like sqlflow.train
.
And, we'd like to make the SQLFlow server "stateless", then submitting the workflow YAML and fetching the status should be done in Go. So:
@shendiaomo @Yancey1989 @typhoonzero It seems that we need two levels of Python API, the high level API talks to SQLFlow server to generate an execution plan, it just another client. The low level API does the real execution of SQL query or AI job. Should we do the compile thing in the low level like before (in step)? I'd prefer not, am I right?
A complete rethinking of a tool starts with the requirements of its users.
User Requirements
SQLFlow users want
Compiler and Executor Are for Statements, Not Programs
The key is to execute the statement. To do that, we need
plan = compile(statement)
, andexecute(plan)
.To enable users to use the compiler and the executor, the simplest form of UI is command-line tools. So users can run run
They could even run the two commands in with a pipe
Indeed, MySQL compiler handles statements but not program either.
Concurrent Execution of Statements
Yes, we can analyze dependencies between statements in a program, and try to run statements in parallel. But this is not the work of the above compiler and executor. This is the work of a
sqlflow_program_executor
.The trivial implementation could be a bash script.
SQLFlow Server is Not Must-to-Have
As a DBMS enhancement, SQLFlow needs to serve multiple users. This can be done by providing a server like MySQL server, but a server is not necessary. Users can run the above pipe-command in their laptop, or a container or a VM closer to the production environment. In such cases, they can all use SQLFlow without an SQLFlow server.
Connect to GUI
We support and considered to support the following GUI systems:
All of them support running a program and running statements one-by-one.
To make them working with SQLFlow, we need to write some form of plugins for them. The plugin can do its work as long as it can run the above commands
sqlflow_compiler
andsqlflow_executor
. No need for a server.