Open jrossthomson opened 2 years ago
Uploaded a testing version of the SQL
https://github.com/jrossthomson/bigquery-utils/blob/master/udfs/community/anovaftest.sqlx
Trying to determine if the answer is correct. I get different answers from jStat than I do from R.
My code gives:
+-------------------+
| results |
+-------------------+
| 4.559109578898445 |
+-------------------+
In R-lang:
jrossthomson@drjay:~/bigquery-utils/udfs/community$ R
INFO: This is the new version of R for gLinux. Learn more at http://go/new-r-google
R version 3.6.3 (2020-02-29). Docs: http://go/rlang. Help: http://g/r-users. Bugs: http://go/new-r-bug.
> iris <- read.csv("iris.csv")
> aov.iris <- aov(petal_width ~ species, data=iris)
> summary(aov.iris)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 51.72 25.859 532.6 <2e-16 ***
Residuals 97 4.71 0.049
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
In jStat:
import jStat from "../jstat/dist/jstat.js";
var virginica= [ 1.7,2.0,1.5,1.9,2.1,1.8,1.8,2.5,1.8,1.8,1.8,2.3,1.9,2.2,1.6,1.8,1.8,2.0,1.5,2.4,2.3,1.9,1.9,1.8,2.4,2.4,1.4,2.2,2.1,1.8,2.5,2.3,1.9,2.1,2.0,1.8,1.8,2.3,2.1,2.3,2.1,2.0,2.3,2.0,2.3,2.5,2.3,2.1,2.0,2.2]
var vesicolor=[1.1,1.0,1.0,1.0,1.2,1.3,1.3,1.3,1.5,1.5,1.5,1.5,1.5,1.6,1.3,1.7,1.2,1.4,1.1,1.3,1.3,1.4,1.8,1.0,1.0,1.1,1.0,1.3,1.3,1.5,1.4,1.3,1.6,1.0,1.2,1.4,1.4,1.3,1.5,1.5,1.3,1.5,1.2,1.3,1.3,1.5,1.2,1.4,1.4,1.6]
var setosa = [0.2,0.3,0.1,0.1,0.1,0.1,0.2,0.2,0.2,0.2,0.2,0.2,0.4,0.4,0.4,0.2,0.2,0.5,0.3,0.2,0.4,0.3,0.3,0.3,0.1,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.4,0.1,0.6,0.2,0.2,0.2,0.2,0.2,0.4,0.3,0.3,0.2,0.2,0.2,0.2,0.4]
console.log(jStat.anovafscore(virginica,vesicolor,setosa))
Gives the result:
jrossthomson@drjay:~/bigquery-utils$ node t.mjs
959.3244057257613
@jrossthomson providing an alternative way to write the function below. This still returns the same (probably incorrect) result, but maybe the math in the final statement is a bit easier to debug:
DECLARE data ARRAY<STRUCT<factor STRING, val FLOAT64>>;
set data = (SELECT ARRAY(SELECT AS STRUCT species, petal_width FROM `bigquery-public-data.ml_datasets.iris`));
CREATE TEMP FUNCTION MMM (data ARRAY<STRUCT<factor STRING, val FLOAT64>>) AS ((
WITH raw_data AS (
SELECT d.val AS v, d.factor as f
FROM UNNEST(data) AS d
),
global_stats AS (
SELECT VAR_SAMP(v) as var, COUNT(*) as cnt, COUNT(DISTINCT f) as discnt FROM raw_data
),
sample_stats AS (
SELECT VAR_SAMP(v) as var, NULL as cnt, NULL as discnt FROM raw_data GROUP BY f
)
SELECT
(SELECT var / (cnt-1) FROM global_stats)
/ (
(SELECT SUM(var) FROM sample_stats)
/ (SELECT cnt - discnt FROM global_stats)
)
));
SELECT MMM(data) AS results;
Thanks, this looks much more readable.
For you R-lang lovers out there, here is the aov source.
Okay, this works:
DECLARE data ARRAY<STRUCT<factor STRING, val FLOAT64>>;
set data = (SELECT ARRAY(SELECT AS STRUCT species, petal_width FROM `bigquery-public-data.ml_datasets.iris`));
CREATE TEMP FUNCTION MMM (data ARRAY<STRUCT<factor STRING, val FLOAT64>>) AS ((
WITH raw_data AS (
SELECT d.val AS v, d.factor as f
FROM UNNEST(data) AS d
),
global_stats AS (
SELECT AVG(v) as M, COUNT(*) - COUNT(DISTINCT f) as cnt FROM raw_data
),
exp_var AS (
SELECT COUNT(v) * POW(AVG(v) - (select M from global_stats), 2) as var, NULL as cnt, NULL as discnt FROM raw_data GROUP BY f
),
unexp_var as (
SELECT COUNT(v) * VAR_POP(v) as var FROM raw_data GROUP BY f
)
SELECT
(
(SELECT SUM(var) / (COUNT(var)-1) FROM exp_var)
/ (SELECT SUM(var) / (select cnt from global_stats) FROM unexp_var)
)
));
SELECT MMM(data) AS results;
Gives:
results
959.32440572575945
For R-lang:
petal.width.aov <- aov(formula = Petal.Width ~ Species, data = iris)
> summary(object = petal.width.aov)
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 80.41 40.21 960 <2e-16 ***
Residuals 147 6.16 0.04
---
@imathews can you try to run through the testing? I am stumped by:
Compiling...
Step #5 - "test_udfs":
Step #5 - "test_udfs": Compilation errors:
Step #5 - "test_udfs": definitions/community/anovaftest.sqlx: Error: Actions cannot resolve operations which do not produce output.
Step #5 - "test_udfs": at Session.resolve (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:23452:31)
Step #5 - "test_udfs": at OperationContext.resolve (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:21970:39)
Step #5 - "test_udfs": at OperationContext.self (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:21954:21)
Step #5 - "test_udfs": at sqlContextable (/workspace/tests/dataform_testing_framework/community_deploy/definitions/community/anovaftest.sqlx:33:30)
Step #5 - "test_udfs": at OperationContext.apply (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:21989:20)
Step #5 - "test_udfs": at Operation.compile (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:21944:40)
Step #5 - "test_udfs": at /workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:23592:46
Step #5 - "test_udfs": at Array.forEach (<anonymous>)
Step #5 - "test_udfs": at Session.compileGraphChunk (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:23590:17)
Step #5 - "test_udfs": at Session.compile (/workspace/tests/dataform_testing_framework/community_deploy/node_modules/@dataform/core/bundle.js:23541:30)
Build issues resolved.
https://en.wikipedia.org/wiki/Analysis_of_variance