jrossthomson / bigquery-utils

Useful scripts, udfs, views, and other utilities for migration and data warehouse operations in BigQuery.
https://cloud.google.com/bigquery/
Apache License 2.0
0 stars 0 forks source link

UDF: ANOVA #21

Open jrossthomson opened 2 years ago

jrossthomson commented 2 years ago

https://en.wikipedia.org/wiki/Analysis_of_variance

jrossthomson commented 2 years ago

https://www.itl.nist.gov/div898/handbook/ppc/section2/ppc232.htm

jrossthomson commented 2 years ago

Uploaded a testing version of the SQL

https://github.com/jrossthomson/bigquery-utils/blob/master/udfs/community/anovaftest.sqlx

Trying to determine if the answer is correct. I get different answers from jStat than I do from R.

My code gives:

+-------------------+
|      results      |
+-------------------+
| 4.559109578898445 |
+-------------------+

In R-lang:

jrossthomson@drjay:~/bigquery-utils/udfs/community$ R
INFO: This is the new version of R for gLinux. Learn more at http://go/new-r-google
R version 3.6.3 (2020-02-29). Docs: http://go/rlang. Help: http://g/r-users. Bugs: http://go/new-r-bug.
> iris <- read.csv("iris.csv")
> aov.iris <- aov(petal_width ~ species, data=iris)
> summary(aov.iris)
            Df Sum Sq Mean Sq F value Pr(>F)    
species      2  51.72  25.859   532.6 <2e-16 ***
Residuals   97   4.71   0.049                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 

In jStat:

import jStat from "../jstat/dist/jstat.js";

var virginica= [ 1.7,2.0,1.5,1.9,2.1,1.8,1.8,2.5,1.8,1.8,1.8,2.3,1.9,2.2,1.6,1.8,1.8,2.0,1.5,2.4,2.3,1.9,1.9,1.8,2.4,2.4,1.4,2.2,2.1,1.8,2.5,2.3,1.9,2.1,2.0,1.8,1.8,2.3,2.1,2.3,2.1,2.0,2.3,2.0,2.3,2.5,2.3,2.1,2.0,2.2]
var vesicolor=[1.1,1.0,1.0,1.0,1.2,1.3,1.3,1.3,1.5,1.5,1.5,1.5,1.5,1.6,1.3,1.7,1.2,1.4,1.1,1.3,1.3,1.4,1.8,1.0,1.0,1.1,1.0,1.3,1.3,1.5,1.4,1.3,1.6,1.0,1.2,1.4,1.4,1.3,1.5,1.5,1.3,1.5,1.2,1.3,1.3,1.5,1.2,1.4,1.4,1.6]
var setosa = [0.2,0.3,0.1,0.1,0.1,0.1,0.2,0.2,0.2,0.2,0.2,0.2,0.4,0.4,0.4,0.2,0.2,0.5,0.3,0.2,0.4,0.3,0.3,0.3,0.1,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.4,0.1,0.6,0.2,0.2,0.2,0.2,0.2,0.4,0.3,0.3,0.2,0.2,0.2,0.2,0.4]

console.log(jStat.anovafscore(virginica,vesicolor,setosa))

Gives the result:

jrossthomson@drjay:~/bigquery-utils$ node t.mjs 
959.3244057257613
jrossthomson commented 2 years ago

Iris Data Analysis

https://sites.calvin.edu/scofield/courses/s145/handouts/ch8.pdf

https://rpubs.com/CPEL/anova_ttest

imathews commented 2 years ago

@jrossthomson providing an alternative way to write the function below. This still returns the same (probably incorrect) result, but maybe the math in the final statement is a bit easier to debug:

DECLARE data ARRAY<STRUCT<factor STRING, val FLOAT64>>;
set data = (SELECT ARRAY(SELECT AS STRUCT species, petal_width FROM `bigquery-public-data.ml_datasets.iris`));

CREATE TEMP FUNCTION MMM (data ARRAY<STRUCT<factor STRING, val FLOAT64>>) AS ((
    WITH raw_data AS (
        SELECT d.val AS v, d.factor as f
        FROM UNNEST(data) AS d 
    ),
    global_stats AS (
        SELECT VAR_SAMP(v) as var, COUNT(*) as cnt, COUNT(DISTINCT f) as discnt FROM raw_data
    ),
    sample_stats AS (
        SELECT VAR_SAMP(v) as var, NULL as cnt, NULL as discnt FROM raw_data GROUP BY f
    )

    SELECT
        (SELECT var / (cnt-1) FROM global_stats) 
    / (
            (SELECT SUM(var) FROM sample_stats)  
            / (SELECT cnt - discnt FROM global_stats)
        )            
));

SELECT MMM(data) AS results;
jrossthomson commented 2 years ago

Thanks, this looks much more readable.

jrossthomson commented 2 years ago

For you R-lang lovers out there, here is the aov source.

https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/stats/R/aov.R

jrossthomson commented 2 years ago

Okay, this works:

DECLARE data ARRAY<STRUCT<factor STRING, val FLOAT64>>;
set data = (SELECT ARRAY(SELECT AS STRUCT species, petal_width FROM `bigquery-public-data.ml_datasets.iris`));

CREATE TEMP FUNCTION MMM (data ARRAY<STRUCT<factor STRING, val FLOAT64>>) AS ((
    WITH raw_data AS (
        SELECT d.val AS v, d.factor as f
        FROM UNNEST(data) AS d 
    ),
    global_stats AS (
        SELECT AVG(v) as M, COUNT(*) - COUNT(DISTINCT f)  as cnt FROM raw_data
    ),
    exp_var AS (
        SELECT COUNT(v) * POW(AVG(v) - (select M from global_stats), 2) as var, NULL as cnt, NULL as discnt FROM raw_data GROUP BY f
    ),
    unexp_var as (
        SELECT COUNT(v) * VAR_POP(v) as var FROM raw_data GROUP BY f
    )

    SELECT
        ( 
            (SELECT SUM(var) / (COUNT(var)-1) FROM exp_var) 
           /    (SELECT SUM(var) / (select cnt from global_stats) FROM unexp_var) 
        )

));

SELECT MMM(data) AS results;

Gives:

results
959.32440572575945
jrossthomson commented 2 years ago

For R-lang:

petal.width.aov <- aov(formula = Petal.Width ~ Species, data = iris)
> summary(object = petal.width.aov)
             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  80.41   40.21     960 <2e-16 ***
Residuals   147   6.16    0.04                   
---
jrossthomson commented 2 years ago

@imathews can you try to run through the testing? I am stumped by:

 Compiling...
Step #5 - "test_udfs":
Step #5 - "test_udfs":   Compilation errors:
Step #5 - "test_udfs":   definitions/community/anovaftest.sqlx: Error: Actions cannot resolve operations which do not produce output.
Step #5 - "test_udfs":     at Session.resolve (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:23452:31)
Step #5 - "test_udfs":     at OperationContext.resolve (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:21970:39)
Step #5 - "test_udfs":     at OperationContext.self (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:21954:21)
Step #5 - "test_udfs":     at sqlContextable (/workspace/tests/dataform_testing_framework/community_deploy/definitions/community/anovaftest.sqlx:33:30)
Step #5 - "test_udfs":     at OperationContext.apply (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:21989:20)
Step #5 - "test_udfs":     at Operation.compile (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:21944:40)
Step #5 - "test_udfs":     at /workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:23592:46
Step #5 - "test_udfs":     at Array.forEach (<anonymous>)
Step #5 - "test_udfs":     at Session.compileGraphChunk (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:23590:17)
Step #5 - "test_udfs":     at Session.compile (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:23541:30)
jrossthomson commented 2 years ago

Build issues resolved.