sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework
https://elasticdl.org
MIT License
733 stars 113 forks source link

Generate data analysis SQL using code_gen from data transform expression in SQLFlow. #1667

Open brightcoder01 opened 4 years ago

brightcoder01 commented 4 years ago

The following transform functions contains analysis. The analysis work should be done at first to make the transform logic concrete.

Name Feature Column Template Analyzer
STANDARDIZE(x) numeric_column({var_name}, normalizer_fn=lambda x : x - {mean} / {std}) MEAN, STDDEV
NORMALIZE(x) numeric_column({var_name}, normalizer_fn=lambda x : x - {min} / {max} - {min}) MAX, MIN
BUCKETIZE(x, bucket_num=y) bucketized_column({var_name}, boundaries={percentiles}) PERCENTILE
APPLY_VOCAB(x) categorical_column_with_vocabulary({var_name}, vocabulary_list={vocabulary_list}) DISTINCT
HASH(x, hash_bucket_size) categorical_column_with_hash_bucket COUNT(DISTINCT)

The SQLFlow syntax for data transform and analysis is discussed in #1664

brightcoder01 commented 4 years ago

Let's take the following SQLFlow statement for example:

SELECT *
FROM census_income
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 20]
COLUMN (
    NUMERIC(NORMALIZE(capital_gain)), 
    NUMERIC(STANDARDIZE(age)), 
    EMBEDDING(BUCKETIZE(hours_per_week, bucket_num=5), dim=32),
    EMBEDDING(APPLY_VOCAB(occupation), dim=16),
    EMBEDDING(HASH(workclass), dim=8)
LABEL label

The generated analysis SQL in MaxCompute is:

Calculate the min&max of capital_gain, mean&stddev of age, the recommended hash_bucket_size for workclass

SELECT 
    MIN(capital_gain) AS _capital_gain_min_, 
    MAX(capital_gain) AS _capital_gain_max_, 
    AVG(age) AS _age_mean_, 
    STDDEV(age) AS _age_stddev_,
    (COUNT(DISTINCT(workclass)) * 3) AS _workclass_hash_bucket_num_
FROM census_income;

Calculate the bucket boundary of hours_per_week

SELECT 
    percentile(hours_per_week, 0.2) AS _hours_per_week_bkt_boundry_1_,
    percentile(hours_per_week, 0.4) AS _hours_per_week_bkt_boundry_2_,
    percentile(hours_per_week, 0.6) AS _hours_per_week_bkt_boundry_3_,
    percentile(hours_per_week, 0.8) AS _hours_per_week_bkt_boundry_4_
FROM census_income;

Calculate the vocabulary of occupation

SELECT DISTINCT(occupation) 
FROM census_income;
brightcoder01 commented 4 years ago

In order to simplify the analysis SQL to calculate the bucket boundary of hours_per_week, I also followed the instructions of the percentile function in MaxCompute. But unexpectedly it doesn't work.

SELECT
    percentile(hours_per_week, array(0.2, 0.4, 0.6, 0.8))
FROM census_income;

The error message is: Invalid argument type - invalid type ARRAY of argument 2 for function percentile, expect DOUBLE