Open brightcoder01 opened 4 years ago
Let's take the following SQLFlow statement for example:
SELECT *
FROM census_income
TO TRAIN DNNClassifier
WITH model.hidden_units = [10, 20]
COLUMN (
NUMERIC(NORMALIZE(capital_gain)),
NUMERIC(STANDARDIZE(age)),
EMBEDDING(BUCKETIZE(hours_per_week, bucket_num=5), dim=32),
EMBEDDING(APPLY_VOCAB(occupation), dim=16),
EMBEDDING(HASH(workclass), dim=8)
LABEL label
The generated analysis SQL in MaxCompute is:
Calculate the min&max of capital_gain
, mean&stddev of age
, the recommended hash_bucket_size for workclass
SELECT
MIN(capital_gain) AS _capital_gain_min_,
MAX(capital_gain) AS _capital_gain_max_,
AVG(age) AS _age_mean_,
STDDEV(age) AS _age_stddev_,
(COUNT(DISTINCT(workclass)) * 3) AS _workclass_hash_bucket_num_
FROM census_income;
Calculate the bucket boundary of hours_per_week
SELECT
percentile(hours_per_week, 0.2) AS _hours_per_week_bkt_boundry_1_,
percentile(hours_per_week, 0.4) AS _hours_per_week_bkt_boundry_2_,
percentile(hours_per_week, 0.6) AS _hours_per_week_bkt_boundry_3_,
percentile(hours_per_week, 0.8) AS _hours_per_week_bkt_boundry_4_
FROM census_income;
Calculate the vocabulary of occupation
SELECT DISTINCT(occupation)
FROM census_income;
In order to simplify the analysis SQL to calculate the bucket boundary of hours_per_week, I also followed the instructions of the percentile
function in MaxCompute. But unexpectedly it doesn't work.
SELECT
percentile(hours_per_week, array(0.2, 0.4, 0.6, 0.8))
FROM census_income;
The error message is: Invalid argument type - invalid type ARRAY
The following transform functions contains analysis. The analysis work should be done at first to make the transform logic concrete.
The SQLFlow syntax for data transform and analysis is discussed in #1664