amaya382 commented 8 years ago

Feature selection

Feature selection is the process of selecting a subset consisting of influential features from multiple features. It is an important technique to enhance results, shorten training time and make features human-understandable.

Currently, following is temporary I/F

Candidates for internal selecting methods

chi2(for non-negative data only)
SNR
~~mRMR~~
Common

[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

Input

array X	array Y
a row of matrix	a row of matrix

Output

array<array> dotted
`dot(X.T, Y)`, shape = (X.#cols, Y.#cols)

[UDF] `select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>`

Input

array X	array importance list	int k
array	the larger, the more important	top-?

Output

array<array> k-best elements
top-k elements from X based on indices of importance list

/***

Note

Current implementation expects _ALL each importance_list and k are equal_. It maybe confuse us.
- Future WA: add option showing use of common importance_list and k

***/

chi2

[UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`

Input

both observed and expected, shape = (#classes, #features)

array observed	array expected
observed features	expected features, `dot(class_prob.T, feature_count)`

Output

struct<array, array> importance lists
chi2-values and p-values each feature, each shape = (1, #features)

Example - chi2

CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

WITH stats AS (
  SELECT
    -- [UDAF] transpose_and_dot(Y::array<number>, X::array<number>)::array<array<double>>
    transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
    array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
    array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
  FROM
    input
),
test AS (
  SELECT
    transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
  FROM
    stats
),
chi2 AS (
  SELECT
    -- [UDAF] chi2(observed::array<array<double>>, expected::array<array<double>>)::struct<array<double>, array<double>>
    chi2(observed, expected) AS chi2s -- struct<array<double>, array<double>>, each shape = (1, n_features)
  FROM
    test JOIN stats;
)
SELECT
  -- [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>
  select_k_best(X, chi2s.chi2, 2) -- top-2 feature selection based on chi2 score
FROM
  input JOIN chi2;

SNR

[UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`

Input

array X	array Y
a row of matrix, overall shape = (#samples, #features)	a row of one-hot matrix, overall shape = (#samples, #classes)

Output

array importance list
snr values of each feature, shape = (1, #features)

Note

There is no need to one-hot vectorize, but fitting its interface to chi2's one
Example - snr

CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

WITH snr AS (
  -- [UDAF] snr(features::array<number>, labels::array<int>)::array<double>
  SELECT snr(X, Y) AS snr FROM input -- aggregated SNR as array<double>, shape = (1, #features)
)
SELECT select_k_best(X, snr, ${k}) FROM input JOIN snr;

myui commented 8 years ago

@amaya382 depends on input format of each algorithm. If we can use same format as the input, better to provide a feature_selection function w/ an algorithm option.

Also, how to apply distributed processing is another issue. mRMR is processed in parallel by MapReduce in some implementation.

myui commented 8 years ago

@amaya382 we can do better as follows:

X::relation(array<double>)  # n_features * n_examples matrix
Y::relation(array<int>)     # n_class * n_examples matrix

observed = dotp(Y.T, X)     # n_classes * n_features matrix
UDAF transpose_and_dot(Y, X)::array<array<double>>

feature_count = X.sum       # n_features col vector
UDAF array_sum(X)::array<double>

class_prob = Y.mean         # n_class col vector
UDAF array_avg(Y)::array<double>

expected = dotp(class_prob.T, feature_count)    # n_class * n_features matrix
UDAF transpose_and_dot(class_prob, feature_count)::array<array<double>>

chi2,pval = chisquare(observed, expected)   # n_class * n_features => n_features
UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>>

chi2    # n_features col vector
pval    # n_features col vector

create table input (
  X array<double>, -- features
  Y array<int> -- binarized label 
);

WITH stats (
  select
    -- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
    transpose_and_dot(Y, X) as observed, -- array<array<double>> # n_classes * n_features matrix
    array_sum(X) as feature_count, -- n_features col vector # array<double>
    array_avg(Y) as class_prob -- n_class col vector # array<double>
  from
    input
),
test as (
  select
    observed, 
    -- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
    transpose_and_dot(class_prob, feature_count) as expected -- array<array<double>> # n_class * n_features matrix
  from
    stats
),
select
  -- UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>> 
  chi2(observed, expected) as (chi2, pval) -- struct<array<double>,array<double>> # n_features
from
  test;

select
  select_k_best(X, T.chi2, ${k}) as X, -- feature selection based on chi2 score
  Y
from
  input
  CROSS JOIN chi2 T
;

What you have to develop is just transpose_and_dot UDAF and chi2 UDAF. transpose_and_dot is bits tricky but can be incrementally implemented.

amaya382 commented 8 years ago

@myui okay, I'll try this strategy

myui commented 8 years ago

@amaya382 could you validate your implementation to other chi2 implementation (e.g., scikit-learn) in the unit and system test?

amaya382 commented 7 years ago

@myui

chi2 (Iris dataset)

results by hivemall with systemtest, which exec query actually

	f0	f1	f2	f3
chi2_vals	10.817820878493995	3.5944990176817315	116.16984746363957	67.24482558215503
p_vals	0.004476514990225833	0.16575416718561453	0.0	2.55351295663786E-15

results by scikit-learn

	f0	f1	f2	f3
chi2_vals	10.81782088	3.59449902	116.16984746	67.24482759
p_vals	4.47651499e-03	1.65754167e-01	5.94344354e-26	2.50017968e-15

amaya382 commented 7 years ago

@myui

SNR (Iris dataset)

results by hivemall with systemtest, which exec query actually, incremental algorithm

	f0	f1	f2	f3
aggregated SNR	3.2700426804423826	1.9034599450433007	11.359577906085278	9.794847727147216

results by python-numpy, batch algorithm

	f0	f1	f2	f3
aggregated SNR	3.27004268	1.90345995	11.35957791	9.79484773

Also already tested on EMR, worked properly

myui / hivemall

Support feature selection #338

Feature selection

Candidates for internal selecting methods

Common

[UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`

Input

Output

[UDF] `select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>`

Input

Output

Note

chi2

[UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`

Input

Output

Example - chi2

SNR

[UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`

Input

Output

Note

Example - snr

chi2 (Iris dataset)

results by hivemall with systemtest, which exec query actually

results by scikit-learn

SNR (Iris dataset)

results by hivemall with systemtest, which exec query actually, incremental algorithm

results by python-numpy, batch algorithm

myui / hivemall

Support feature selection #338

Feature selection

Candidates for internal selecting methods

Common

[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

Input

Output

[UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>

Input

Output

Note

chi2

[UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

Input

Output

Example - chi2

SNR

[UDAF] snr(X::array<number>, Y::array<int>)::array<double>

Input

Output

Note

Example - snr

chi2 (Iris dataset)

results by hivemall with systemtest, which exec query actually

results by scikit-learn

SNR (Iris dataset)

results by hivemall with systemtest, which exec query actually, incremental algorithm

results by python-numpy, batch algorithm

[UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`

[UDF] `select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>`

[UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`

[UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`