Open exalate-issue-sync[bot] opened 1 year ago
Wendy commented: Karthik and I met on Oct 29, 2020 and had a good discussion. A summary of work topics are as follows:
Wendy commented: Karthik: for all families, I am curious to see if a n-fold CV(lambda) will be close enough to CV(lambda) of 3.10 for what n?
Wendy commented: November 6, 2020:
Meeting summary with karthik:
We decided to do the following:
!image-20201106-231310.png|width=576,height=83!
and report deviation of cv(lambda) 3.10 and 3.19. Generate the two for many scale (lambda) values and plot and compare.
Come up with a scheme to decide what scale(lamda) values to try.
reverse engineer R and see how they select scale for multiple GAM columns and other families.
Wendy needs to
A1. provide scripts to generate random GAM data and
A2. where the calculation of A is in the GLM code.
Wendy commented: Completing A1: To generate synthetic GAM dataset for Gaussian family, use the following python file. Store it under the h2o-3/h2o-py/tests/testdir_algos/gam directory:
{noformat}import sys sys.path.insert(1,"../../../") import h2o from tests import pyunit_utils from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator as gam
def test_define_dataset(): family = 'binomial' # can be any valid GLM families nrow = 100000 ncol = 1 realFrac = 1 intFrac = 0 enumFrac = 0 missing_fraction = 0 factorRange= 50 numericRange = 10 targetFactor = 1 numGamCols = 1
assert numGamCols <= ncol*realFrac, "Number of real columns {0} should exceed the number of gam columns " \
"{1}".format(ncol*realFrac, numGamCols) # gam can be only real columns
gamDataSet = generate_dataset(family, nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
numericRange, targetFactor, numGamCols)
#h2o.download_csv(gamDataSet, "/Users/.../dataset.csv") # save dataset
assert gamDataSet.nrow == nrow, "Dataset number of row: {0}, expected number of row: {1}".format(gamDataSet.nrow,
nrow)
assert gamDataSet.ncol == (1+ncol), "Dataset number of row: {0}, expected number of row: " \
"{1}".format(gamDataSet.ncol, (1+ncol))
def generate_dataset(family, nrow, ncol, realFrac, intFrac, enumFrac, missingFrac, factorRange, numericRange, targetFactor, numGamCols): if family=="binomial": responseFactor = 2 elif family == 'gaussian': responseFactor = 1; else : responseFactor = targetFactor
trainData = random_dataset(nrow, ncol, realFrac=realFrac, intFrac=intFrac, enumFrac=enumFrac, factorR=factorRange,
integerR=numericRange, responseFactor=responseFactor, misFrac=missingFrac)
myX = trainData.names
myY = 'response'
myX.remove(myY)
colNames = trainData.names
colNames.remove("response")
m = gam(family=family, max_iterations=10, gam_columns = colNames[0:numGamCols])
m.train(training_frame=trainData,x=myX,y= myY)
f2 = m.predict(trainData)
# to see coefficient, do m.coef()
finalDataset = trainData[myX]
finalDataset = finalDataset.cbind(f2[0])
finalDataset.set_name(col=finalDataset.ncols-1, name='response')
h2o.remove(trainData)
return finalDataset
def random_dataset(nrow, ncol, realFrac = 0.4, intFrac = 0.3, enumFrac = 0.3, factorR = 10, integerR=100, responseFactor = 1, misFrac=0.01, randSeed=None): fractions = dict() if (ncol==1) and (realFrac >= 1.0): fractions["real_fraction"] = 1 # Right now we are dropping string columns, so no point in having them. fractions["categorical_fraction"] = 0 fractions["integer_fraction"] = 0 fractions["time_fraction"] = 0 fractions["string_fraction"] = 0 # Right now we are dropping string columns, so no point in having them. fractions["binary_fraction"] = 0
return h2o.create_frame(rows=nrow, cols=ncol, missing_fraction=misFrac, has_response=True,
response_factors = responseFactor, integer_range=integerR,
seed=randSeed, **fractions)
real_part = pyunit_utils.random_dataset_real_only(nrow, (int)(realFrac*ncol), misFrac=misFrac, randSeed=randSeed)
enumFrac = enumFrac + (1-realFrac)/2
intFrac = 1-enumFrac
fractions["real_fraction"] = 0 # Right now we are dropping string columns, so no point in having them.
fractions["categorical_fraction"] = enumFrac
fractions["integer_fraction"] = intFrac
fractions["time_fraction"] = 0
fractions["string_fraction"] = 0 # Right now we are dropping string columns, so no point in having them.
fractions["binary_fraction"] = 0
df = h2o.create_frame(rows=nrow, cols=(ncol-real_part.ncol), missing_fraction=misFrac, has_response=True,
response_factors = responseFactor, integer_range=integerR,
seed=randSeed, **fractions)
return real_part.cbind(df)
if name == "main": pyunit_utils.standalone_test(test_define_dataset) else: test_define_dataset(){noformat}
Wendy commented: Nov 12, 2020
Meeting summary with Karthik M:
For the coming week, Karthik aims to finish the following:
Other discussion includes:
Wendy commented: Good free interactive courses to understand GAM:
[https://noamross.github.io/gams-in-r-course/|https://noamross.github.io/gams-in-r-course/]
Wendy commented: Nov 30, 2020
Met with karthik. Here is the meeting summary:
!image-20201130-230501.png|width=485,height=161!
using synthetic GAM dataset. Set lambda=0, alpha=0 and scale parameter = 0 to find an estimate of the MSE and try scale parameter values from 0 to MSE with the interval set using the following methods: Let the maximum value that you are going to try be m. Set min_ratio to 1e-4. Let the number of scale parameters to try be n. Then, set decrement = Math.pow(min_ratio, 1.0/(n-1)). Then, use the following for loop to determine all scale parameters:
double scaleParam = new double[n];
scaleParam[0] = m;
for (int index=1; index < n; index++)
scaleParam[i] = (m *= dec).
Note that
!image-20201201-043904.png|width=386,height=64!
and in GLM, the following is calculated:
!image-20201201-043931.png|width=345,height=66!
You will need to use what GLM has but change them to calculate A instead. See our discussion during chat.
Wendy commented: December 3rd, 2020:
Met with Karthik and here is a summary of our discussion:
!image-20201203-222132.png|width=683,height=93!
to solve what he needs
!image-20201203-222148.png|width=291,height=33!
in order to calculate vg.
Wendy commented: Let G = (XT(X)+lambdaS);
Gbeta = T(X)y = xy # this is what chol.solve is trying to solve
chol.sove will return beta;
We decided to get A by two parts. First, get B = inverse((XT(X)+lambdaS)) * T(X).
Then A = X*B.
B = inv(G)*T(X)
G*B = T(X) # solve for B one column at a time.
G*B1 = T(X) column1
To generalize for the ith column.
GB(identity matrix column i) = T(X)*identity matrix column i)
Part of the X matrix is stored in “_dinfo._adaptedFrame”. You are lacking the column of ones from “_dinfo._adaptedFrame”.
X = [“_dinfo._adaptedFrame” col(ones)]
To do this, here is an example on how to add the column of ones:
Frame X = new Frame(_dinfo._adaptedFrame);
X.add(“colOnes”, Vec.makeOne(_dinfo._adaptedFrame.numRows()));
Wendy commented: December 14, 2020
Met with Karthik and we discussed how to generate the cross-validation score.
{noformat}mseSum = 0 nrow = data.nrow for ind in range(0, nrow): valid = data[ind,0:2] temp1 = data[0:ind, 0:2] temp2 = data[ind+1:nrow, 0:2] if (temp1.nrow==0): train = temp2 elif (temp2.nrow==0): train = temp1 else: train = temp1.rbind(temp2) h2o_model = H2OGeneralizedAdditiveEstimator(family='gaussian', gam_columns=["C1"]) h2o_model.train(y='response', training_frame=train, validation_frame=valid) mseSum = mseSum+h2o_model.mse(valid=True) h2o.remove(valid) # need to remove them so as not to clog h2o.remove(temp1) # up the memory of the machine h2o.remove(temp2) h2o.remove(train){noformat}
Wendy commented: December 21, 2020
Meeting summary with karthik today:
Wendy commented: December 28, 2020
Met with Karthik and we discussed the following:
Karthik Murthy commented: {noformat}import sys sys.path.insert(1,"../../../") import h2o from tests import pyunit_utils from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator as gam import pandas as pd import matplotlib.pyplot as plt import numpy as np
class cv_graph_generator:
loss_data = []
# This test will generate synthetic GAM dataset. If given to a GAM model, it should be able to perform well with
# this dataset since the assumptions associated with GAM are used to generate the dataset.
def test_define_dataset(self):
family = 'gaussian' # can be any valid GLM families
nrow = 100
ncol = 1
realFrac = 1
intFrac = 0
enumFrac = 0
missing_fraction = 0
factorRange= 50
numericRange = 10
targetFactor = 1
numGamCols = 1
min_ratio = 1e-1
num_trials = 10
nfolds = nrow
loss = self.generate_dataset(family, nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
numericRange, targetFactor, numGamCols, min_ratio, nfolds, num_trials)
df = pd.DataFrame(loss)
df.to_csv("loss-data.csv")
print(df)
print("Done")
# self.loss_data.append(self.generate_dataset("binomial", nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
# numericRange, targetFactor, numGamCols, scale, nfolds, scale_div))
#
# self.loss_data.append(self.generate_dataset("binomial", nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
# numericRange, targetFactor, numGamCols, scale, 5, scale_div))
#
# self.loss_data.append(self.generate_dataset("multinomial", nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
# numericRange, 5, numGamCols, scale, nfolds, scale_div))
#
# self.loss_data.append(self.generate_dataset("multinomial", nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
# numericRange, 5, numGamCols, scale, 5, scale_div))
def generate_dataset(self, family, nrow, ncol, realFrac, intFrac, enumFrac, missingFrac, factorRange, numericRange,
targetFactor, numGamCols, min_ratio=1e-4, nfolds=0, num_trials=1):
if family=="binomial":
responseFactor = 2
elif family == 'gaussian':
responseFactor = 1
else :
responseFactor = targetFactor
trainData = self.random_dataset(nrow, ncol, realFrac=realFrac, intFrac=intFrac, enumFrac=enumFrac, factorR=factorRange,
integerR=numericRange, responseFactor=responseFactor, misFrac=missingFrac)
myX = trainData.names
myY = 'response'
myX.remove(myY)
colNames = trainData.names
colNames.remove("response")
avg_loss = []
scale = 2947.189508523056 * 1000
scaleParam = []
for i in range(2, num_trials + 2):
dec = min_ratio**(1.0/(i - 1))
scale *= dec
scaleParam.append(scale)
m = gam(family=family, gam_columns = colNames[0:numGamCols], lambda_=0, alpha=0, scale=[scale], nfolds=nfolds, fold_assignment="modulo", seed=1)
m.train(training_frame=trainData, x=myX, y=myY)
# loss = 0
# for j in range(nfolds):
# loss += (m.cross_validation_models()[j].mse() / nfolds)
avg_loss.append((-(2 * (i - 1)), m.mse(xval=True)))
f2 = m.predict(trainData)
# to see coefficient, do m.coef()
finalDataset = trainData[myX]
finalDataset = finalDataset.cbind(f2[0])
finalDataset.set_name(col=finalDataset.ncols-1, name='response')
h2o.download_csv(finalDataset, "dataset.csv")
return avg_loss
def random_dataset(self, nrow, ncol, realFrac = 0.4, intFrac = 0.3, enumFrac = 0.3, factorR = 10, integerR=100,
responseFactor = 1, misFrac=0.01, randSeed=7):
fractions = dict()
if (ncol==1) and (realFrac >= 1.0):
fractions["real_fraction"] = 1 # Right now we are dropping string columns, so no point in having them.
fractions["categorical_fraction"] = 0
fractions["integer_fraction"] = 0
fractions["time_fraction"] = 0
fractions["string_fraction"] = 0 # Right now we are dropping string columns, so no point in having them.
fractions["binary_fraction"] = 0
return h2o.create_frame(rows=nrow, cols=ncol, missing_fraction=misFrac, has_response=True,
response_factors = responseFactor, integer_range=integerR,
seed=randSeed, **fractions)
real_part = pyunit_utils.random_dataset_real_only(nrow, (int)(realFrac*ncol), misFrac=misFrac, randSeed=randSeed)
enumFrac = enumFrac + (1-realFrac)/2
intFrac = 1-enumFrac
fractions["real_fraction"] = 0 # Right now we are dropping string columns, so no point in having them.
fractions["categorical_fraction"] = enumFrac
fractions["integer_fraction"] = intFrac
fractions["time_fraction"] = 0
fractions["string_fraction"] = 0 # Right now we are dropping string columns, so no point in having them.
fractions["binary_fraction"] = 0
df = h2o.create_frame(rows=nrow, cols=(ncol-real_part.ncol), missing_fraction=misFrac, has_response=True,
response_factors=responseFactor, integer_range=integerR,
seed=randSeed, **fractions)
return real_part.cbind(df)
def generate_graphs(): generator = cv_graph_generator() generator.test_define_dataset() for dataset in generator.loss_data: print(dataset) print("done")
if name == "main": h2o.init(ip='192.168.1.4', port=54321, strict_version_check=False) pyunit_utils.standalone_test(generate_graphs()) else: h2o.init(ip='192.168.1.4', port=54321, strict_version_check=False) generate_graphs() {noformat}
Wendy commented: Jan 4, 2021 Meeting summary with Karthik
require(gamair); data(engine); attach(engine)
plot(size,wear,xlab="Engine capacity",ylab="Wear index")
The rest of the code can be found from page 165 to 170 of the document that I have sent you before. It is chapter 4 of Simon Woods book.
All the formula describing the X matrix will assume that we have added a column of 1’s for the intercept coefficient. However, in model._output._dinfo._adaptedFrame, there is no column of ones added and it is okay when you try to use chol.solve to get the coefficients. However, in your last stage where you are going to obtain X * B, you will need to add the column of ones to be the last column of X.
You are going to solve for B one column at a time. In our discussion, each column of B is an array. The final B will be converted to a frame. There are multiple ways to generate a frame. One is to call new Frame on an array of vectors. Here is an example:
{{Vec[] res = new Vec[ncoly]; }}
// generate keys for all the vectors
{{Key
{{res[y] = Vec.makeVec(res_array[y], keys[y]); }}
{{ }}}
You can get a frame by
{{new Frame(col_names, res}});// col_names is a string array
Wendy commented: The scale parameter controls overfitting if too many knots are specified. Our dataset is generated with a small number of knots and therefore there are not wriggliness. I have generated a new dataset with more wriggliness and here it is.
[^gam_1Col_40perKnots_2000Rows.csv]
Wendy commented: Try this code and you will see the mse actually is a bowl:
{noformat}import sys sys.path.insert(1,"../../../") import h2o from tests import pyunit_utils from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator as gam
def generate_graphs():
input = h2o.import_file("/Users/wendycwong/temp/gam_1Col_40perKnots_2000Rows.csv")
frames = input.split_frame(ratios=[0.05]) # can change the ratio to higher, I was running out of memory
train = frames[0]
response = 'response'
scale_parameter = [0, 0.0001, 0.001, 0.01, 0.1, 1, 10]
num_knots = [int(0.1*train.nrow), int(train.nrow*0.2), int(0.3*train.nrow), int(0.4*train.nrow), int(train.nrow*0.5)]
xval_mse_total = []
val_mse_total = []
for numKnots in num_knots:
xval_mse = []
val_mse = []
for scale in scale_parameter:
gam_model = gam(family = "gaussian", alpha = 0, Lambda = 0, gam_columns = ["C1"], scale = [scale],
nfolds = train.nrow, fold_assignment="modulo", num_knots=[numKnots])
gam_model.train(x=[], y=response, training_frame = train, validation_frame=frames[1])
xval_mse.append(gam_model.mse(xval=True))
val_mse.append(gam_model.mse(valid=True))
xval_mse_total.append(xval_mse)
val_mse_total.append(val_mse)
print(xval_mse_total)
print(val_mse_total)
print(xval_mse_total)
print(val_mse_total)
if name == "main": h2o.init(ip = "192.168.86.41", port = 54321, strict_version_check=False) pyunit_utils.standalone_test(generate_graphs()) else: h2o.init(ip = "192.168.86.41", port = 54321, strict_version_check=False) generate_graphs(){noformat}
Wendy commented: For cases with a fixed number of knots, if we do not see a bowl, it probably means that no overfitting is observed and we can just set scale close to be 0
Wendy commented: Karthik:
Attached is a good document describing overfitting:
[https://www.eecis.udel.edu/~arce/files/Courses/StatLearning/Overfitting and Regularization.pdf|https://www.eecis.udel.edu/~arce/files/Courses/StatLearning/Overfitting%20and%20Regularization.pdf]
We can discuss this on Monday.
Wendy
Wendy commented: Jan 11, 2021
Had a meeting with Karthik and we discussed the following:
Wendy commented: Jan 18, 2021
Summarized discussion with Karthik here:
For this week, Karthik will focus on these two tasks:
Karthik Murthy commented: Here is the test dataset I used when generating cross-validation scores:
[^good-dataset.csv]
Wendy commented: Jan 25, 2021:
Summarized discussion with Karthik:
{noformat}{noformat}
Wendy commented: Feb 11, 2021:
Met with Karthik and here is the summary:
Thanks, Wendy
JIRA Issue Migration Info
Jira Issue: PUBDEV-7865 Assignee: Karthik Murthy Reporter: Wendy State: Open Fix Version: N/A Attachments: Available (Count: 9) Development PRs: Available
Linked PRs from JIRA
https://github.com/h2oai/h2o-3/pull/5241 https://github.com/h2oai/h2o-3/pull/5332
Attachments From Jira
Attachment Name: gam_1Col_40perKnots_2000Rows.csv Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/gam_1Col_40perKnots_2000Rows.csv
Attachment Name: GAM_doc.pdf Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/GAM_doc.pdf
Attachment Name: good-dataset.csv Attached By: Karthik Murthy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/good-dataset.csv
Attachment Name: image-20201106-231310.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201106-231310.png
Attachment Name: image-20201130-230501.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201130-230501.png
Attachment Name: image-20201201-043904.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201201-043904.png
Attachment Name: image-20201201-043931.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201201-043931.png
Attachment Name: image-20201203-222132.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201203-222132.png
Attachment Name: image-20201203-222148.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201203-222148.png
Wendy commented: The Simon Wood book can be found in this JIRA: [https://h2oai.atlassian.net/browse/PUBDEV-6781?jql=text%20~%20%22GAM%22|https://h2oai.atlassian.net/browse/PUBDEV-6781?jql=text%20~%20%22GAM%22|smart-link] Read section 4.2.3
I emailed you the other book, read section 3.4