sql-machine-learning / sqlflow

Brings SQL and AI together.
https://sqlflow.org
Apache License 2.0
5.05k stars 698 forks source link

[Discussion] SQLFlow statement into steps. #2340

Open brightcoder01 opened 4 years ago

brightcoder01 commented 4 years ago

A SQLFlow statement can be generated into one or more couler steps. We are listing what each step does and what image it relies on here:

TO TRAIN:

step image data access
SELECT Data from source table into Temp table step image SQL
数据探查(对于categorical column,计算distinct count)and feature derivation (如果distinct_count比较小试用VOCABULARIZE进行id化,如果比较大,则使用Hash等等) step image SQL: DESCRIBE for table schema, Aggregation SQL
Data Auto Analysis step image SQL
Generate transform code and submit training model image from model zoo PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO PREDICT:

step image data access
SELECT Data from source table into Temp table step image SQL
Use the transform code from training and submit Prediction Job model image from model zoo PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO EVALUATE:

step image data access
SELECT Data from source table into Temp table step image SQL
Use the transform code from training and submit Evaluation Job model image from model zoo PAI-TF(inside MaxCompute), ElasticDL(read into K8S)

TO RUN:

step image data access
SELECT Data from source table into Temp table step image SQL
Run the python function in docker directly (Kubernetes) / Submit a PyODPS Task via Alisa (MaxCompute) function image from model zoo Kubernetes(need access), PyODPS task(inside MaxCompute)

Discussion points:

Deployment Platform vs Step (Continous Improvement) Google Cloud Amazon AWS Alibaba MaxCompute Travis CI Vanilla Kubernetes
Store SELECT FROM into a temp table step image(BigQuery) step image(goalisa)
Data Exploration + Feature Derivation step image(BigQuery) step image(goalisa)
Data Analysis step image(BigQuery) step image(goalisa)
Generate Transform code + Submit Training model image
Yancey1989 commented 4 years ago

We should clean up the temp at the end of each step, so we need three steps at leas for each job:

  1. create a temp table
  2. submit a job
  3. clean up the temp table

Another question, how do we confirm the clean up step would be scheduled if the job failed?

wangkuiyi commented 4 years ago

I am not sure if we could formalize the execution of the TO TRAIN statement into a specific number n of steps.

However, I have an idea that might help us figure this out: let us draw a 2-dimensional table -- one dimension includes various deployments (on Google Cloud, Amazon AWS, Alibaba internal, Travis CI, etc), and the other one includes the proposed steps. In each cell, we need to figure out the Docker image to be executed by a step in the given deployment.

We might add more dimensional like various DBMSes, AI platforms, model definition APIs (TensorFlow/Keras, PyTorch, XGBoost), data readers (MaxCompute multi-tunnel reader, local filesystem reader, etc).

We also need to take the model zoo design into consideration because it seems that at least one of the steps to execute the TO TRAIN statement needs to train a model defined in a zoo.

typhoonzero commented 4 years ago

The user can choose to use different base images to adapt his model definition, e.g. tf-2.0, tf-1.15, pytorch-1.0 etc. SQLFlow provides a list of base images for users to choose, SQLFlow guarantees that each base image should work on every execution engine including locally run, ElasticDL, PAI etc.

brightcoder01 commented 4 years ago

For the first step of all the statements SELECT Data from source table into Temp table, the behaviors are different for various DBMS in the current status:

*Question: Can we make the behavior unified? Write the result of `SELECT FROM` into a temporary storage(table) for all the DBMS?**

weiguoz commented 4 years ago

For the first step of all the statements SELECT Data from source table into Temp table, the behaviors are different for various DBMS in the current status:

  • MySQL/Hive: We don't create the temporary table.
  • MaxCompute: We create the temporary table.

Addition: It's no matter with the DBMS, but up to the AI engine.

brightcoder01 commented 4 years ago
  • Should we differentiate the step image by Platforms?
  • Should model / function images from model zoo be platform agnostic?

For these two questions above, the discussion notes: 1) Yes. Different step images for various deployment platforms (one image for one platform): Vanilla Kubernetes, Alibaba internal, Google Cloud and so on. 2) No. Different model zoo service for various deployment platforms. For the same keras model definition such as DNNClassifier, we will build different model images with different base images for various deployment platforms. For example, the base image for Vanilla Kubernetes contains elasticdl, and the base image for Alibaba internal contains goalisa.

workingloong commented 4 years ago

We should clean up the temp at the end of each step, so we need three steps at leas for each job:

  1. create a temp table
  2. submit a job
  3. clean up the temp table

Another question, how do we confirm the clean up step would be scheduled if the job failed?

We can set the lifecycle for the temp table.