sql-machine-learning / sqlflow

Brings SQL and AI together.
https://sqlflow.org
Apache License 2.0
5.05k stars 698 forks source link

Rethinking the code structure #2494

Open Yancey1989 opened 4 years ago

Yancey1989 commented 4 years ago

Per discussions on the video meeting with @typhoonzero @shendiaomo

Two Components in SQLFlow

Compiler

  1. Front, the parser package parses SQL statement(s) and generates IR(s).
  2. Semantics Analysis(Runtime), which includes feature derivation, verifier, attribute filler and checker, model reloading.
  3. Optimizer(static), analyze the dependency of SQL statements, and generate a parallel.
  4. Backend, various code generator, which produces YAML(Argo workflow) file or AI program(TF/XGBoost), or programming program(optflow).

Interpreter

SQLFlow compiler generates a two-layers graph, and two kinds of interpreter execute each layer,

  1. First Graph(Argo Workflow), Argo controller is the interpreter.
  2. Secondary Graph(AI program), Python/PAI command-line/EDL command-line is the interpreter.

The Hoped Code Structure

/pkg
    /interpreter(executor)
        /graph(Argo)
        /node(python/pai/alisa)
    /compiler
        /parser
        /semantics analysis(runtime)
            /feature_derivation
            /verifier
            /model_reload
            /attribute filer && checker
        /optimizer(static)
            /parallel graph
        /backend(codegen)

Incomplete Thinking About the Final

Insufficient and Thinking of the Current System

  1. the workflow graph losses much detailed information.

    We hope SQLFlow generates a more detailed graph. For example, if the graph can describe a group of TensorFlow ops running on CPU/GPU, .e.g. that we can optimize the AI pipeline throughput.

  2. workflow can not get the best throughput.

    The streaming graph can get better throughput. For example, we can custom a TensorFlow op to read data generated from the SELECT clause in streaming, instead of creating a temporary table.

Code Structure

pkg/
    /interpreter
        /argo(graph executor)
        /node(subgraph executor)
            /semantics analysis(runtime?JIT?)
        /feature_derivation
        /verifier
        /model
        /attribute filler && checker
  /graph
  /compiler
      /parser
      /optimizer(静态)
          /parallel graph
      /backend(graph)

Something that needs to discuss next:

wangkuiyi commented 4 years ago

is SQLFlow a compiler or interpreter? It doesn't make sense for it to be both.

We don't have and probability will never have two layer of graphs. We don't have any graph in our system. The top level is a workflow represented by YAML. A workflow is not a graph. An Argo/Tekton workflow can have conditionals, loops, and even function definitions and function calls, whereas a graph cannot have these. The lower level is a Python program, which is not a graph either. A Python program could have all kinds of control flows, but a graph cannot.

It is disappointing to see our team members still sticking to the plastic idea of "graph" known as TensorFlow's early version used it as a very non-professional form of IR. Especially those who experienced PaddlePaddle, which tried so hard to propose an IR that is much more powerful than graph. All the way, innovators like Chris Lattern have been introducing the professional form of IR into TensorFlow, but so sorry that people cannot see the efforts.

shendiaomo commented 4 years ago

The Current Structure

After several structure-adjustment PRs (#2481 #2484 #2491 #2500 #2502 #2505 ), the current package structure has become:

pkg
├── attribute # semantics
├── codegen # codegen
├── database # basic lib
├── executor # step execution
├── ir # intermediate representation
├── log # basic lib
├── model # basic lib
├── modelzooserver # server
├── parser # syntax
├── pipe # basic lib
├── proto # protobuf definitions
├── sql # step execution
├── sqlflowserver # server
├── sqlfs # basic lib
├── step # step execution
├── tar # basic lib
├── test # basic lib
├── verifier # semantics
└── workflow # workflow execution

The Proposed Structure

There're still several problems:

  1. We can restructure the packages according to their functionalities as standard components of a compiler, for example: put attribute and verifier in a semantics package, put all basic libraries in a basic package, put sqlflowserver and modelzooserver in a server package
  2. The executor generates code for step execution and executes the code subsequently. We should decouple the code generation phase and execution phase, and put the decoupled code in codegen and step respectively. Similarly, because the sql package calls executor for step execution, the files in sql should be put in step. After this stage, the package structure should be:
    pkg
    ├── basic
    │   ├── database
    │   ├── log
    │   ├── model
    │   ├── pipe
    │   ├── sqlfs
    │   ├── tar
    │   └── test
    ├── codegen
    │   ├── alisa.go
    │   ├── pai.go
    │   ├── ...
    │   └── couler
    ├── ir
    ├── parser
    ├── proto
    ├── semantics
    │   ├── attribute
    │   └── verifier
    ├── server
    │   ├── modelzooserver
    │   └── sqlflowserver
    └── execution
        ├──  step
        │    └── executor.go
        └── workflow
  3. We have a 2-pass compilation architecture: 1) the first pass generates workflow yaml and submit the yaml; 2)the 2nd pass is in step execution, it use step -e to generates and executes the python scripts. The architecture makes SQLFlow neither a "pure" compiler nor an "pure" interpreter. We can make SQLFlow a one-pass compiler: the only pass generates the yaml and all the scripts, the scripts are in a directory to be used as Argo input artifacts. After this phase, we don't need the pkg/step and cmd/step anymore.
Yancey1989 commented 4 years ago

I agree that there are two-layers architecture on the current code base, and that makes SQLFlow not clear.

  1. The 1st layer, SQLFlow translates a SQL program into a workflow which is a YAML file, the Argo controller is the executor to execute this workflow. -- SQLFlow is a compiler.
  2. The 2nd layer, each workflow step, executes a SQL statement using the SQLFlow step command-line, which translates a SQL statement into Python script and executes it. -- SQLFlow step command-line is much like an interpreter.

To make it more clear, I think we can keep the two-layers architecture, and SQLFlow is a pure compiler.

  1. The 1st layer, SQLFlow generates a workflow, each workflow step includes an entry point program, which is a Python program.
  2. The 2nd layer, each workflow step executes this Python scripts using the Python interpreter.
Yancey1989 commented 4 years ago

After this phase, we don't need the pkg/step and cmd/step anymore.

So we don't need the pkg/execution/step folder ?

brightcoder01 commented 4 years ago

3. We have a 2-pass compilation architecture: 1) the first pass generates workflow yaml and submit the yaml; 2)the 2nd pass is in step execution, it use step -e to generates and executes the python scripts. The architecture makes SQLFlow neither a "pure" compiler nor an "pure" interpreter. We can make SQLFlow a one-pass compiler: the only pass generates the yaml and all the scripts, the scripts are in a directory to be used as Argo input artifacts.

In the current architecture, we always run step -e {sql_statement} in each step. It will bring the limit that one SQL statement is mapped to only one step. And the step binary does the parse and build IR work again in the step image. It contains some duplicated work. In the future, one SQL statement can be translated into several steps. So it would be better that after translating the SQL program into a work flow. We can get what each step executes obviously such as Data Analysis, Data Exploration, Model training instead of executing a general command step -e.

shendiaomo commented 4 years ago

After this phase, we don't need the pkg/step and cmd/step anymore.

So we don't need the pkg/execution/step folder ?

No, we don't. We only have to move something like table_writer to the basic package.

shendiaomo commented 4 years ago

I agree that there are two-layers architecture on the current code base, and that makes SQLFlow not clear.

  1. The 1st layer, SQLFlow translates a SQL program into a workflow which is a YAML file, the Argo controller is the executor to execute this workflow. -- SQLFlow is a compiler.
  2. The 2nd layer, each workflow step, executes a SQL statement using the SQLFlow step command-line, which translates a SQL statement into Python script and executes it. -- SQLFlow step command-line is much like an interpreter.

To make it more clear, I think we can keep the two-layers architecture, and SQLFlow is a pure compiler.

  1. The 1st layer, SQLFlow generates a workflow, each workflow step includes an entry point program, which is a Python program.
  2. The 2nd layer, each workflow step executes this Python scripts using the Python interpreter.

In a discussion with @Yancey1989 , we found that we still have to implement a feature derivation mechanism in Python like the previous migration prototype to make the proposed structure available.

The problem is that:

  1. The feature derivation mechanism must run in step execution.
  2. The codegen package in the current architecture depends heavily on feature derivation to generate python code.

As a result, we have to first generate a .yaml in sqlflowserver and secondly generate .pys in the step binary.

wangkuiyi commented 4 years ago

As a result, we have to first generate a .yaml in sqlflowserver and secondly generate .pys in the step binary.

Since SQLFlow is a compiler, which doesn't care of the execution, it seems that we should have a command-line compiler. What name is good for the binary file of the compiler?

Yancey1989 commented 4 years ago

As a result, we have to first generate a .yaml in sqlflowserver and secondly generate .pys in the step binary.

I think we can generate a .yaml file, for each step, call tensorflow/xgboost/pai code generator to generate a submitter program entrypoint which is .py script, the following snippet .yaml is an very simple example:

steps:
    name: step-1
    args: ["python", "-c"]
    command: |
        import sqlflow.runtime.tensorflow
        tensorflow.train(....)

That tensorflow.train call feature derivation, verifier, and then train Tensorflow model.

We can also separate feature derivation, verifier, into separated steps to decouple workflow step logic.

Yancey1989 commented 4 years ago

Since SQLFlow is a compiler, which doesn't care of the execution, it seems that we should have a command-line compiler. What name is good for the binary file of the compiler?

@wangkuiyi sqlflow is good.

sneaxiy commented 4 years ago

That tensorflow.train call feature derivation, verifier, and then train Tensorflow model.

I am afraid that if we use this way, we would move many Go codes into Python.

In this way, the sqlflowserver may only do the following things:

All other things would be done in Python, for example:

Python codes may be less maintainable than Go codes.

Yancey1989 commented 4 years ago

The updated code structure as the following reason based on https://github.com/sql-machine-learning/sqlflow/issues/2494#issuecomment-647692915

  1. move semantics from Go to runtime Python pacakge.
  2. remove basic top folder.
  3. move Go code to go folder.
|-go
|  |--cmd
|  |  |--sqlflowserver          // SQLFlow gRPC server
|  |  |--modelzooserver         // SQLFlow Model Zoo gRPC server
|  |  `--sqlflow                // SQLFlow command-line tool
|  |--pkg
|  |  |--ir
|  |  |--parser
|  |  |--log
|  |  |--model
|  |  |--pipe
|  |  |--sqlfs
|  |  |--tar
|  |  |--test
|  |  |--codegen
|  |  |  |--pai
|  |  |  |  |--tensorflow
|  |  |  |  |--xgboost
|  |  |  |  |--kmeans
|  |  |  |--alisa
|  |  |  |--tensorflow
|  |  |  |--couler
|  |  |  `--xgboost
|  |  |--server                 // SQLFlow server interface implementation
|  |  |  |--proto
|  |  |  |--run
|  |  |  `--fetch
|  |  |--modelzoo
|  |  |--executor
|  |  |  |--argo                // Argo is workflow executor
|  |  |  `--python              // Python is workflow step executor
|-python
|  |--sqlflow.runtime
|  |  |--pai
|  |  |  `--tensorlfow/xgboost/shap
|  |  |--alisa
|  |  |--tensorflow
|  |  |--xgboost
|  |  |--feature_derivation
|  |  |--verifier
|  |  `--db_writer
|  `--couler
`-java

The following tasks should be done:

Yancey1989 commented 4 years ago

Python codes may be less maintainable than Go codes.

@sneaxiy agree with that, for my option, we can keep the Go packages feature_derivation/verifier/sqlfs, and exports them as Python API, that we can call them in Python runtime package. How do you think?

sneaxiy commented 4 years ago

@sneaxiy agree with that, for my option, we can keep the Go packages feature_derivation/verifier/sqlfs, and exports them as Python API, that we can call them in Python runtime package. How do you think?

@Yancey1989 It may be more complex. Let us find some ways for better maintainable Python codes, such as improving code coverage, etc.

Yancey1989 commented 4 years ago

TODOs for SQLFlow compiler refactor:

  1. Move feature_derivation to the runtime Python package.
  2. Separate verifier into two parts.
    1. Compile-time does the attribute checker.
    2. Runtime verifies data schema.
  3. Refactor the existing code generators.
    1. update based on newly feature derivation and verifier code.
    2. add codegen/pai to generate PAI submitter program.
    3. add codegen/alisa to generate Alisa submitter program.
  4. Move workflow step response Go package to Python.

There are two main problems from the above plans:

  1. Python code is harder to maintain then Go. We can do the following items to solve it:
    1. Google Code Style
    2. Improve code coverage
    3. Some type checker tools e.g. pytype to check Python types in static.
  2. ROI. We should move many Go code to Python, which takes about two man-months, do we do that immediately?
weiguoz commented 4 years ago

Supply a Python db-api to access alisa. @Yancey1989

brightcoder01 commented 4 years ago
  • add codegen/pai to generate PAI submitter program.

During workflow compilation, for TO RUN statement, we will have the flexibility to generate different command line call for the step according to deployment platform and execution program. Upgrade the compiler to generate the various code in the step according to these two or more variables.