Open Yancey1989 opened 4 years ago
is SQLFlow a compiler or interpreter? It doesn't make sense for it to be both.
We don't have and probability will never have two layer of graphs. We don't have any graph in our system. The top level is a workflow represented by YAML. A workflow is not a graph. An Argo/Tekton workflow can have conditionals, loops, and even function definitions and function calls, whereas a graph cannot have these. The lower level is a Python program, which is not a graph either. A Python program could have all kinds of control flows, but a graph cannot.
It is disappointing to see our team members still sticking to the plastic idea of "graph" known as TensorFlow's early version used it as a very non-professional form of IR. Especially those who experienced PaddlePaddle, which tried so hard to propose an IR that is much more powerful than graph. All the way, innovators like Chris Lattern have been introducing the professional form of IR into TensorFlow, but so sorry that people cannot see the efforts.
After several structure-adjustment PRs (#2481 #2484 #2491 #2500 #2502 #2505 ), the current package structure has become:
pkg
├── attribute # semantics
├── codegen # codegen
├── database # basic lib
├── executor # step execution
├── ir # intermediate representation
├── log # basic lib
├── model # basic lib
├── modelzooserver # server
├── parser # syntax
├── pipe # basic lib
├── proto # protobuf definitions
├── sql # step execution
├── sqlflowserver # server
├── sqlfs # basic lib
├── step # step execution
├── tar # basic lib
├── test # basic lib
├── verifier # semantics
└── workflow # workflow execution
There're still several problems:
attribute
and verifier
in a semantics
package, put all basic libraries in a basic
package, put sqlflowserver
and modelzooserver
in a server
packageexecutor
generates code for step execution and executes the code subsequently. We should decouple the code generation phase and execution phase, and put the decoupled code in codegen
and step
respectively. Similarly, because the sql
package calls executor
for step execution, the files in sql
should be put in step
. After this stage, the package structure should be:
pkg
├── basic
│ ├── database
│ ├── log
│ ├── model
│ ├── pipe
│ ├── sqlfs
│ ├── tar
│ └── test
├── codegen
│ ├── alisa.go
│ ├── pai.go
│ ├── ...
│ └── couler
├── ir
├── parser
├── proto
├── semantics
│ ├── attribute
│ └── verifier
├── server
│ ├── modelzooserver
│ └── sqlflowserver
└── execution
├── step
│ └── executor.go
└── workflow
step -e
to generates and executes the python scripts. The architecture makes SQLFlow neither a "pure" compiler nor an "pure" interpreter. We can make SQLFlow a one-pass compiler: the only pass generates the yaml and all the scripts, the scripts are in a directory to be used as Argo input artifacts.
After this phase, we don't need the pkg/step
and cmd/step
anymore.I agree that there are two-layers architecture on the current code base, and that makes SQLFlow not clear.
To make it more clear, I think we can keep the two-layers architecture, and SQLFlow is a pure compiler.
After this phase, we don't need the pkg/step and cmd/step anymore.
So we don't need the pkg/execution/step
folder ?
3. We have a 2-pass compilation architecture: 1) the first pass generates workflow yaml and submit the yaml; 2)the 2nd pass is in step execution, it use
step -e
to generates and executes the python scripts. The architecture makes SQLFlow neither a "pure" compiler nor an "pure" interpreter. We can make SQLFlow a one-pass compiler: the only pass generates the yaml and all the scripts, the scripts are in a directory to be used as Argo input artifacts.
In the current architecture, we always run step -e {sql_statement}
in each step. It will bring the limit that one SQL statement is mapped to only one step. And the step binary does the parse and build IR work again in the step image. It contains some duplicated work.
In the future, one SQL statement can be translated into several steps. So it would be better that after translating the SQL program into a work flow. We can get what each step executes obviously such as Data Analysis
, Data Exploration
, Model training
instead of executing a general command step -e
.
After this phase, we don't need the pkg/step and cmd/step anymore.
So we don't need the
pkg/execution/step
folder ?
No, we don't. We only have to move something like table_writer
to the basic
package.
I agree that there are two-layers architecture on the current code base, and that makes SQLFlow not clear.
- The 1st layer, SQLFlow translates a SQL program into a workflow which is a YAML file, the Argo controller is the executor to execute this workflow. -- SQLFlow is a compiler.
- The 2nd layer, each workflow step, executes a SQL statement using the SQLFlow step command-line, which translates a SQL statement into Python script and executes it. -- SQLFlow step command-line is much like an interpreter.
To make it more clear, I think we can keep the two-layers architecture, and SQLFlow is a pure compiler.
- The 1st layer, SQLFlow generates a workflow, each workflow step includes an entry point program, which is a Python program.
- The 2nd layer, each workflow step executes this Python scripts using the Python interpreter.
In a discussion with @Yancey1989 , we found that we still have to implement a feature derivation mechanism in Python like the previous migration prototype to make the proposed structure available.
The problem is that:
As a result, we have to first generate a .yaml
in sqlflowserver
and secondly generate .py
s in the step
binary.
As a result, we have to first generate a
.yaml
insqlflowserver
and secondly generate.py
s in thestep
binary.
Since SQLFlow is a compiler, which doesn't care of the execution, it seems that we should have a command-line compiler. What name is good for the binary file of the compiler?
As a result, we have to first generate a .yaml in sqlflowserver and secondly generate .pys in the step binary.
I think we can generate a .yaml file, for each step, call tensorflow/xgboost/pai
code generator to generate a submitter program entrypoint which is .py script, the following snippet .yaml is an very simple example:
steps:
name: step-1
args: ["python", "-c"]
command: |
import sqlflow.runtime.tensorflow
tensorflow.train(....)
That tensorflow.train
call feature derivation, verifier, and then train Tensorflow model.
We can also separate feature derivation, verifier, into separated steps to decouple workflow step logic.
Since SQLFlow is a compiler, which doesn't care of the execution, it seems that we should have a command-line compiler. What name is good for the binary file of the compiler?
@wangkuiyi sqlflow
is good.
That tensorflow.train call feature derivation, verifier, and then train Tensorflow model.
I am afraid that if we use this way, we would move many Go codes into Python.
In this way, the sqlflowserver may only do the following things:
tensorflow.train/predict/evaluate/...
All other things would be done in Python, for example:
Python codes may be less maintainable than Go codes.
The updated code structure as the following reason based on https://github.com/sql-machine-learning/sqlflow/issues/2494#issuecomment-647692915
runtime
Python pacakge.basic
top folder.go
folder.|-go
| |--cmd
| | |--sqlflowserver // SQLFlow gRPC server
| | |--modelzooserver // SQLFlow Model Zoo gRPC server
| | `--sqlflow // SQLFlow command-line tool
| |--pkg
| | |--ir
| | |--parser
| | |--log
| | |--model
| | |--pipe
| | |--sqlfs
| | |--tar
| | |--test
| | |--codegen
| | | |--pai
| | | | |--tensorflow
| | | | |--xgboost
| | | | |--kmeans
| | | |--alisa
| | | |--tensorflow
| | | |--couler
| | | `--xgboost
| | |--server // SQLFlow server interface implementation
| | | |--proto
| | | |--run
| | | `--fetch
| | |--modelzoo
| | |--executor
| | | |--argo // Argo is workflow executor
| | | `--python // Python is workflow step executor
|-python
| |--sqlflow.runtime
| | |--pai
| | | `--tensorlfow/xgboost/shap
| | |--alisa
| | |--tensorflow
| | |--xgboost
| | |--feature_derivation
| | |--verifier
| | `--db_writer
| `--couler
`-java
The following tasks should be done:
go
folder.runtime
Python package.runtime
Python package.pai/alisaSubmitter
to runtime
Python package and only keep Python as the workflow step interpreter.Python codes may be less maintainable than Go codes.
@sneaxiy agree with that, for my option, we can keep the Go packages feature_derivation/verifier/sqlfs
, and exports them as Python API, that we can call them in Python runtime
package. How do you think?
@sneaxiy agree with that, for my option, we can keep the Go packages feature_derivation/verifier/sqlfs, and exports them as Python API, that we can call them in Python runtime package. How do you think?
@Yancey1989 It may be more complex. Let us find some ways for better maintainable Python codes, such as improving code coverage, etc.
TODOs for SQLFlow compiler refactor:
feature_derivation
to the runtime
Python package.verifier
into two parts.
codegen/pai
to generate PAI submitter program.codegen/alisa
to generate Alisa submitter program.response
Go package to Python.There are two main problems from the above plans:
Supply a Python db-api
to access alisa
. @Yancey1989
- add
codegen/pai
to generate PAI submitter program.
During workflow compilation, for TO RUN
statement, we will have the flexibility to generate different command line call for the step according to deployment platform
and execution program
. Upgrade the compiler to generate the various code in the step according to these two or more variables.
Per discussions on the video meeting with @typhoonzero @shendiaomo
Two Components in SQLFlow
Compiler
Interpreter
SQLFlow compiler generates a two-layers graph, and two kinds of interpreter execute each layer,
The Hoped Code Structure
Incomplete Thinking About the Final
Insufficient and Thinking of the Current System
the workflow graph losses much detailed information.
We hope SQLFlow generates a more detailed graph. For example, if the graph can describe a group of TensorFlow ops running on CPU/GPU, .e.g. that we can optimize the AI pipeline throughput.
workflow can not get the best throughput.
The streaming graph can get better throughput. For example, we can custom a TensorFlow op to read data generated from the
SELECT
clause in streaming, instead of creating a temporary table.Code Structure
Something that needs to discuss next: